引言:深度学习训练的"环境地狱"

在深度学习研究和开发中,我们经常遇到这样的场景:

"这个模型在我的机器上跑得好好的,为什么在你的机器上就报错了?"

"CUDA版本不匹配,又要重装整个环境..."

"这个依赖库跟那个依赖库冲突,折腾了半天都没解决..."

这些问题的根源在于深度学习训练对环境的极端敏感性。不同的CUDA版本、Python包版本、系统库版本都可能导致训练过程失败。而Docker容器化技术,正是解决这一痛点的最佳方案。

第一章:宿主机直接训练的四大痛点

1.1 环境依赖的复杂性

深度学习训练环境依赖众多,包括:

  • CUDA工具包:版本必须与GPU驱动兼容

  • cuDNN库:版本必须与CUDA版本匹配

  • Python环境:特定版本的Python解释器

  • 深度学习框架:PyTorch、TensorFlow等

  • 数据处理库:OpenCV、Pillow、Pandas等

  • 系统库:libc、gcc等基础库

$ nvcc --version
CUDA Version 11.7

$ python -c "import torch; print(torch.version.cuda)"
11.6  # 版本不匹配!

$ python train.py
RuntimeError: CUDA error: no kernel image is available for execution on the device

1.2 多项目环境隔离困难

在同一台机器上运行多个项目时,环境冲突尤为明显:

# 项目A需要
torch==1.9.0
torchvision==0.10.0
cudatoolkit=11.1

# 项目B需要  
torch==1.13.0
torchvision==0.14.0
cudatoolkit=11.6

# 冲突不可避免!

1.3 环境复现几乎不可能

三个月前训练的模型,现在想要重新训练或微调:

# 尝试复现旧环境
$ pip install -r requirements.txt
ERROR: Could not find a version that satisfies the requirement torch==1.7.0
ERROR: No matching distribution found for torch==1.7.0

1.4 团队协作效率低下

新成员加入团队时,环境配置成为首要障碍:

新人小王的第一周:
Day 1: 安装CUDA,版本选错,重装
Day 2: 配置Python环境,虚拟环境冲突
Day 3: 安装PyTorch,与CUDA不兼容
Day 4: 各种依赖库版本问题
Day 5: 终于能运行代码了...

第二章:Docker解决方案的核心优势

2.1 环境一致性保障

Docker通过容器技术实现环境隔离和一致性:

# Dockerfile
FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel

# 复制依赖列表
COPY requirements.txt .
RUN pip install -r requirements.txt

2.2 快速环境复制

一次构建,随处运行:

# 构建镜像
docker build -t dl-training:1.0 .

# 在任何机器上运行
docker run --gpus all dl-training:1.0 python train.py

2.3 版本管理和回滚

# 不同版本的镜像管理
docker tag dl-training:1.0 myregistry.com/dl-training:v1.0
docker tag dl-training:1.1 myregistry.com/dl-training:v1.1

# 快速回滚到旧版本
docker run --gpus all myregistry.com/dl-training:v1.0 python train.py

第三章:构建深度学习训练Docker镜像

3.1 方法一:从零构建完整环境

适合有明确需求的项目,环境可控性强(流程繁琐适用于大型项目)。

3.2 方法二:基于官方镜像渐进完善

这种方法特别适合快速验证和迭代开发,让我们重点介绍这种方法:

# 启动PyTorch官方基础容器
docker run -it --gpus all --name pytorch-dev \
  -v $(pwd):/workspace \
  pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel \
  /bin/bash

进入容器后,我们可以立即开始工作:

# 在容器内检查环境
python -c "import torch; print(f'PyTorch版本: {torch.__version__}')"
root@4510ca084efc:/workspace# python -c "import torch; print(f'PyTorch版本: {torch.__version__}')"
PyTorch版本: 2.1.0

python -c "import torch; print(f'CUDA可用: {torch.cuda.is_available()}')"
root@4510ca084efc:/workspace# python -c "import torch; print(f'CUDA可用: {torch.cuda.is_available()}')" 
CUDA可用: True

# 如果CUDA可用,检查GPU信息
python -c "import torch; print(f'GPU数量: {torch.cuda.device_count()}')"
root@4510ca084efc:/workspace# python -c "import torch; print(f'GPU数量: {torch.cuda.device_count()}')"
GPU数量: 2

第四章:基于官方镜像的渐进式环境构建

4.1 启动基础容器并测试

mask-rcnn是一个目标检测demo代码,将代码文件挂进容器进行测试

通过 AIgate-Kubernetes作业 页面中的示例代码下载

# 启动基础PyTorch容器
docker run -itd --gpus all --name dl-workspace \
  -v $(pwd)/code:/workspace/code \
  pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel

root@gpu-3090-vm09:~# docker run -itd --gpus all --name dl-workspace   -v /home/mask-rcnn/code:/workspace/code   pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel
fa9d345b6e20273a553a847df4c81ace91f9166b55d2e64f70552ae578d2d8a1
root@gpu-3090-vm09:~# docker ps -a
CONTAINER ID   IMAGE                                               COMMAND                  CREATED         STATUS                     PORTS     NAMES
fa9d345b6e20   pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel         "/opt/nvidia/nvidia_…"   3 seconds ago   Up 3 seconds                         dl-workspace

# 进入容器
docker exec -it dl-workspace bash

root@gpu-3090-vm09:~# docker exec -it dl-workspace bash
root@f37775dceec2:/workspace/code/mask-rcnn/mask-rcnn# ll
total 92
drwxr-xr-x 5 root root  4096 Oct 14 06:02 ./
drwxr-xr-x 3 root root  4096 Oct 14 06:02 ../
-rw-r--r-- 1 root root 36690 Oct 14 06:02 1.jpg
drwxr-xr-x 5 root root  4096 Oct 14 06:02 PennFudanPed/
-rw-r--r-- 1 root root  4724 Oct 14 06:02 README.md
drwxr-xr-x 3 root root  4096 Oct 14 06:02 detection/
-rw-r--r-- 1 root root  2526 Oct 14 06:02 predict.py
-rw-r--r-- 1 root root  2301 Oct 14 06:02 predict_vision.py
-rw-r--r-- 1 root root   105 Oct 14 06:02 requirements.txt
-rw-r--r-- 1 root root    69 Oct 14 06:02 run.sh
drwxr-xr-x 2 root root  4096 Oct 14 06:02 src/
-rw-r--r-- 1 root root  4036 Oct 14 06:02 test_io.py
-rw-r--r-- 1 root root  7387 Oct 14 06:02 train.py

在容器内部,我们首先验证基础环境:

# test_basic.py
import torch
import torchvision

print("=== Basic Environment Check ===")
print(f"PyTorch Version: {torch.__version__}")
print(f"Torchvision Version: {torchvision.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")

运行结果:

root@fa9d345b6e20:/workspace# vim basic.py
bash: vim: command not found
root@fa9d345b6e20:/workspace# apt update && apt install vim -y

#安装完成后执行检查
root@fa9d345b6e20:/workspace# python basic.py 
=== Basic Environment Check ===
PyTorch Version: 2.1.0
Torchvision Version: 0.16.0
CUDA Available: True
CUDA Version: 11.8
Number of GPUs: 2
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

4.2 逐步安装缺失的依赖

在开发过程中,我们可能会发现缺少某些包,在提供的示例代码中安装依赖:

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
root@f37775dceec2:/workspace/code/mask-rcnn/mask-rcnn# pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting pycocotools==2.0.8 (from -r requirements.txt (line 4))
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/6a/03/6c0bf810a5df7876caaf11f5b113e7ffd4b2fa9767d360489c6fdcefe8e5/pycocotools-2.0.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (427 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 427.8/427.8 kB 5.9 MB/s eta 0:00:00
Collecting pillow==11.2.1 (from -r requirements.txt (line 5))
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f3/5e/7ca9c815ade5fdca18853db86d812f2f188212792780208bdb37a0a6aef4/pillow-11.2.1-cp310-cp310-manylinux_2_28_x86_64.whl (4.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.6/4.6 MB 42.1 MB/s eta 0:00:00
Collecting opencv-python==4.11.0.86 (from -r requirements.txt (line 6))
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2c/8b/90eb44a40476fa0e71e05a0283947cfd74a5d36121a11d926ad6f3193cc4/opencv_python-4.11.0.86-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (63.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.0/63.0 MB 25.8 MB/s eta 0:00:00
Collecting matplotlib>=2.1.0 (from pycocotools==2.0.8->-r requirements.txt (line 4))
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/e2/3c/5692a2d9a5ba848fda3f48d2b607037df96460b941a59ef236404b39776b/matplotlib-3.10.7-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (8.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.7/8.7 MB 57.7 MB/s eta 0:00:00
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from pycocotools==2.0.8->-r requirements.txt (line 4)) (1.26.0)
Collecting contourpy>=1.0.1 (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4))
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/32/5c/1ee32d1c7956923202f00cf8d2a14a62ed7517bdc0ee1e55301227fc273c/contourpy-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (325 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 325.0/325.0 kB 64.0 MB/s eta 0:00:00
Collecting cycler>=0.10 (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4))
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/e7/05/c19819d5e3d95294a6f5947fb9b9629efb316b96de511b418c53d245aae6/cycler-0.12.1-py3-none-any.whl (8.3 kB)
Collecting fonttools>=4.22.0 (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4))
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ce/20/9b2b4051b6ec6689480787d506b5003f72648f50972a92d04527a456192c/fonttools-4.60.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (4.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.8/4.8 MB 47.6 MB/s eta 0:00:00
Collecting kiwisolver>=1.3.1 (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4))
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d4/42/0f333164e6307a0687d1eb9ad256215aae2f4bd5d28f4653d6cd319a3ba3/kiwisolver-1.4.9-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 79.8 MB/s eta 0:00:00
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (23.1)
Collecting pyparsing>=3 (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4))
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/10/5e/1aa9a93198c6b64513c9d7752de7422c06402de6600a8767da1524f9570b/pyparsing-3.2.5-py3-none-any.whl (113 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 113.9/113.9 kB 35.2 MB/s eta 0:00:00
Collecting python-dateutil>=2.7 (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4))
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ec/57/56b9bcc3c9c6a792fcbaf139543cee77261f3651ca9da0c93f5c1221264b/python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 229.9/229.9 kB 62.1 MB/s eta 0:00:00
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (1.16.0)
Installing collected packages: python-dateutil, pyparsing, pillow, opencv-python, kiwisolver, fonttools, cycler, contourpy, matplotlib, pycocotools
  Attempting uninstall: pillow
    Found existing installation: Pillow 9.4.0
    Uninstalling Pillow-9.4.0:
      Successfully uninstalled Pillow-9.4.0
Successfully installed contourpy-1.3.2 cycler-0.12.1 fonttools-4.60.1 kiwisolver-1.4.9 matplotlib-3.10.7 opencv-python-4.11.0.86 pillow-11.2.1 pycocotools-2.0.8 pyparsing-3.2.5 python-dateutil-2.9.0.post0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

修改run.sh中内容:--nproc_per_node=1 这个数量表示调用的卡数

# --nproc_per_node=1 这个数量表示调用的卡数
pip install -r requirements.txt
torchrun --nproc_per_node=1 train.py

运行run.sh

sh run.sh
root@f37775dceec2:/workspace/code/mask-rcnn/mask-rcnn# sh run.sh 
Requirement already satisfied: pycocotools==2.0.8 in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 4)) (2.0.8)
Requirement already satisfied: pillow==11.2.1 in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 5)) (11.2.1)
Requirement already satisfied: opencv-python==4.11.0.86 in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 6)) (4.11.0.86)
Requirement already satisfied: matplotlib>=2.1.0 in /opt/conda/lib/python3.10/site-packages (from pycocotools==2.0.8->-r requirements.txt (line 4)) (3.10.7)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from pycocotools==2.0.8->-r requirements.txt (line 4)) (1.26.0)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (4.60.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (1.4.9)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (23.1)
Requirement already satisfied: pyparsing>=3 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (3.2.5)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (1.16.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Traceback (most recent call last):
  File "/workspace/code/mask-rcnn/mask-rcnn/train.py", line 16, in <module>
    from detection.engine import train_one_epoch, evaluate
  File "/workspace/code/mask-rcnn/mask-rcnn/detection/engine.py", line 7, in <module>
    import detection.utils as utils
  File "/workspace/code/mask-rcnn/mask-rcnn/detection/utils.py", line 10, in <module>
    import cv2
  File "/opt/conda/lib/python3.10/site-packages/cv2/__init__.py", line 181, in <module>
    bootstrap()
  File "/opt/conda/lib/python3.10/site-packages/cv2/__init__.py", line 153, in bootstrap
    native_module = importlib.import_module("cv2")
  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
[2025-10-14 06:12:55,298] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 710) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-10-14_06:12:55
  host      : f37775dceec2
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 710)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

将以上问题缺失的包安装一下:

root@f37775dceec2:/workspace/code/mask-rcnn/mask-rcnn# apt-get update && apt-get install -y \
>     libgl1-mesa-glx \
>     libglib2.0-0 \
>     libsm6 \
>     libxrender-dev \
>     libxext6

#注意:安装过程有2个交互动作需要确认
Please select the geographic area in which you live. Subsequent configuration questions will narrow this down by presenting a list of cities, representing the time zones in which they are located.

  1. Africa  2. America  3. Antarctica  4. Australia  5. Arctic  6. Asia  7. Atlantic  8. Europe  9. Indian  10. Pacific  11. SystemV  12. US  13. Etc  14. Legacy
Geographic area: 6

Please select the city or region corresponding to your time zone.

  1. Aden    5. Aqtau     9. Baghdad   13. Barnaul  17. Chita       21. Damascus  25. Dushanbe   29. Hebron       33. Irkutsk   37. Jerusalem  41. Kashgar    45. Krasnoyarsk   49. Macau     53. Muscat        57. Omsk        61. Pyongyang  65. Rangoon    69. Seoul          73. Taipei    77. Tel_Aviv  81. Ujung_Pandang  85. Vientiane    89. Yekaterinburg
  2. Almaty  6. Aqtobe    10. Bahrain  14. Beirut   18. Choibalsan  22. Dhaka     26. Famagusta  30. Ho_Chi_Minh  34. Istanbul  38. Kabul      42. Kathmandu  46. Kuala_Lumpur  50. Magadan   54. Nicosia       58. Oral        62. Qatar      66. Riyadh     70. Shanghai       74. Tashkent  78. Thimphu   82. Ulaanbaatar    86. Vladivostok  90. Yerevan
  3. Amman   7. Ashgabat  11. Baku     15. Bishkek  19. Chongqing   23. Dili      27. Gaza       31. Hong_Kong    35. Jakarta   39. Kamchatka  43. Khandyga   47. Kuching       51. Makassar  55. Novokuznetsk  59. Phnom_Penh  63. Qostanay   67. Sakhalin   71. Singapore      75. Tbilisi   79. Tokyo     83. Urumqi         87. Yakutsk
  4. Anadyr  8. Atyrau    12. Bangkok  16. Brunei   20. Colombo     24. Dubai     28. Harbin     32. Hovd         36. Jayapura  40. Karachi    44. Kolkata    48. Kuwait        52. Manila    56. Novosibirsk   60. Pontianak   64. Qyzylorda  68. Samarkand  72. Srednekolymsk  76. Tehran    80. Tomsk     84. Ust-Nera       88. Yangon
Time zone: 70


Current default time zone: 'Asia/Shanghai'
Local time is now:      Tue Oct 14 14:21:45 CST 2025.
Universal Time is now:  Tue Oct 14 06:21:45 UTC 2025.
Run 'dpkg-reconfigure tzdata' if you wish to change it.

重新执行run.sh:

root@f37775dceec2:/workspace/code/mask-rcnn/mask-rcnn# sh run.sh
Requirement already satisfied: pycocotools==2.0.8 in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 4)) (2.0.8)
Requirement already satisfied: pillow==11.2.1 in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 5)) (11.2.1)
Requirement already satisfied: opencv-python==4.11.0.86 in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 6)) (4.11.0.86)
Requirement already satisfied: matplotlib>=2.1.0 in /opt/conda/lib/python3.10/site-packages (from pycocotools==2.0.8->-r requirements.txt (line 4)) (3.10.7)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from pycocotools==2.0.8->-r requirements.txt (line 4)) (1.26.0)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (4.60.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (1.4.9)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (23.1)
Requirement already satisfied: pyparsing>=3 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (3.2.5)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib>=2.1.0->pycocotools==2.0.8->-r requirements.txt (line 4)) (1.16.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
/opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97.8M/97.8M [00:03<00:00, 32.7MB/s]
Epoch: [0]  [   0/1675]  eta: 0:37:09  lr: 0.000010  loss: 2.8566 (2.8566)  loss_classifier: 0.6485 (0.6485)  loss_box_reg: 0.0054 (0.0054)  loss_mask: 1.5063 (1.5063)  loss_objectness: 0.6860 (0.6860)  loss_rpn_box_reg: 0.0104 (0.0104)  time: 1.3313  data: 0.1840  max mem: 2033
Epoch: [0]  [  10/1675]  eta: 0:11:44  lr: 0.000060  loss: 3.2466 (3.1440)  loss_classifier: 0.6431 (0.6380)  loss_box_reg: 0.0054 (0.0056)  loss_mask: 1.8846 (1.7880)  loss_objectness: 0.6875 (0.6874)  loss_rpn_box_reg: 0.0228 (0.0250)  time: 0.4231  data: 0.0169  max mem: 2599
Epoch: [0]  [  20/1675]  eta: 0:09:28  lr: 0.000110  loss: 2.2982 (2.6655)  loss_classifier: 0.6100 (0.6040)  loss_box_reg: 0.0023 (0.0054)  loss_mask: 0.9670 (1.3446)  loss_objectness: 0.6862 (0.6866)  loss_rpn_box_reg: 0.0212 (0.0249)  time: 0.2938  data: 0.0001  max mem: 2599
Epoch: [0]  [  30/1675]  eta: 0:08:56  lr: 0.000160  loss: 2.0591 (2.4382)  loss_classifier: 0.5015 (0.5535)  loss_box_reg: 0.0001 (0.0040)  loss_mask: 0.8265 (1.1714)  loss_objectness: 0.6840 (0.6851)  loss_rpn_box_reg: 0.0200 (0.0241)  time: 0.2729  data: 0.0001  max mem: 2990
Epoch: [0]  [  40/1675]  eta: 0:08:24  lr: 0.000210  loss: 1.8016 (2.2649)  loss_classifier: 0.3732 (0.5003)  loss_box_reg: 0.0001 (0.0040)  loss_mask: 0.7222 (1.0557)  loss_objectness: 0.6784 (0.6825)  loss_rpn_box_reg: 0.0171 (0.0224)  time: 0.2718  data: 0.0001  max mem: 2990
Epoch: [0]  [  50/1675]  eta: 0:07:51  lr: 0.000260  loss: 1.6472 (2.1363)  loss_classifier: 0.2881 (0.4533)  loss_box_reg: 0.0050 (0.0049)  loss_mask: 0.6812 (0.9765)  loss_objectness: 0.6696 (0.6789)  loss_rpn_box_reg: 0.0138 (0.0228)  time: 0.2343  data: 0.0001  max mem: 2990
Epoch: [0]  [  60/1675]  eta: 0:07:23  lr: 0.000310  loss: 1.5140 (2.0220)  loss_classifier: 0.2310 (0.4106)  loss_box_reg: 0.0158 (0.0079)  loss_mask: 0.5940 (0.9087)  loss_objectness: 0.6548 (0.6725)  loss_rpn_box_reg: 0.0174 (0.0224)  time: 0.2047  data: 0.0001  max mem: 2990
Epoch: [0]  [  70/1675]  eta: 0:07:10  lr: 0.000360  loss: 1.3637 (1.9185)  loss_classifier: 0.1505 (0.3695)  loss_box_reg: 0.0245 (0.0131)  loss_mask: 0.5380 (0.8522)  loss_objectness: 0.6197 (0.6622)  loss_rpn_box_reg: 0.0174 (0.0215)  time: 0.2120  data: 0.0001  max mem: 3228
Epoch: [0]  [  80/1675]  eta: 0:06:49  lr: 0.000410  loss: 1.2456 (1.8295)  loss_classifier: 0.0936 (0.3349)  loss_box_reg: 0.0433 (0.0199)  loss_mask: 0.5098 (0.8087)  loss_objectness: 0.5560 (0.6445)  loss_rpn_box_reg: 0.0164 (0.0215)  time: 0.2031  data: 0.0001  max mem: 3228
Epoch: [0]  [  90/1675]  eta: 0:06:45  lr: 0.000460  loss: 1.0765 (1.7418)  loss_classifier: 0.0855 (0.3081)  loss_box_reg: 0.0667 (0.0278)  loss_mask: 0.4301 (0.7669)  loss_objectness: 0.4724 (0.6179)  loss_rpn_box_reg: 0.0156 (0.0211)  time: 0.2117  data: 0.0001  max mem: 3228
Epoch: [0]  [ 100/1675]  eta: 0:06:41  lr: 0.000509  loss: 0.9018 (1.6521)  loss_classifier: 0.0778 (0.2855)  loss_box_reg: 0.0816 (0.0338)  loss_mask: 0.4092 (0.7326)  loss_objectness: 0.3143 (0.5800)  loss_rpn_box_reg: 0.0141 (0.0202)  time: 0.2477  data: 0.0001  max mem: 3228

以上日志表示这个demo正在运行。

4.3 遇到的痛点:只能使用docker exec

重要痛点分析
在使用基础镜像开发过程中,最大的不便就是只能通过docker exec进入容器。这意味着:

  1. 网络访问困难:无法直接SSH到容器

  2. 文件传输麻烦:需要频繁使用docker cp

  3. 多会话限制:不能同时开启多个终端会话

  4. 开发体验差:无法使用熟悉的SSH客户端工具

这正是我们需要将SSH等服务集成到最终镜像中的原因。

4.4 基于渐进构建的完整Dockerfile

在demo代码根目录创建basic.py

root@gpu-3090-vm09:/home/mask-rcnn/code/mask-rcnn/mask-rcnn# ll
total 96
drwxr-xr-x 5 root root  4096 Oct 14 06:17 ./
drwxr-xr-x 3 root root  4096 Oct 14 06:02 ../
-rw-r--r-- 1 root root 36690 Oct 14 06:02 1.jpg
-rw-r--r-- 1 root root   492 Oct 14 06:09 basic.py
drwxr-xr-x 3 root root  4096 Oct 14 06:02 detection/
drwxr-xr-x 5 root root  4096 Oct 14 06:02 PennFudanPed/
-rw-r--r-- 1 root root  2526 Oct 14 06:02 predict.py
-rw-r--r-- 1 root root  2301 Oct 14 06:02 predict_vision.py
-rw-r--r-- 1 root root  4724 Oct 14 06:02 README.md
-rw-r--r-- 1 root root   108 Oct 14 06:10 requirements.txt
-rw-r--r-- 1 root root    69 Oct 14 06:12 run.sh
drwxr-xr-x 2 root root  4096 Oct 14 06:02 src/
-rw-r--r-- 1 root root  4036 Oct 14 06:02 test_io.py
-rw-r--r-- 1 root root  7387 Oct 14 06:02 train.py

在demo代码根目录创建Dockerfile

FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

RUN apt-get update && apt-get install -y \
    openssh-server \
    vim \
    libgl1-mesa-glx \
    libglib2.0-0 \
    libsm6 \
    libxrender-dev \
    libxext6 \
    && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /var/run/sshd \
    && echo 'root:deeplearning' | chpasswd \
    && sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config \
    && sed -i 's/#PasswordAuthentication yes/PasswordAuthentication yes/' /etc/ssh/sshd_config

COPY . /workspace

RUN pip install --no-cache-dir -r /workspace/requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple \
    && echo "=== 开始运行basic.py检查环境 ===" \
    && python /workspace/basic.py \
    && echo "=== 环境检查完成 ==="

WORKDIR /workspace

EXPOSE 22

CMD ["/usr/sbin/sshd", "-D"]

构建最终镜像,包含SSH服务:

docker build -t my-dl-environment .

#启动容器
docker run -d --gpus all --name dl-workspace  -p 2222:22   my-dl-environment:latest 
65ecdea83e97d67890cdb727af80563c356448aab9c5804fbbc1fe2ecdc9d075
root@gpu-3090-vm09:/home/mask-rcnn/code/mask-rcnn/mask-rcnn# docker ps -a
CONTAINER ID   IMAGE                                               COMMAND                  CREATED         STATUS                     PORTS                                     NAMES
65ecdea83e97   my-dl-environment:latest                            "/opt/nvidia/nvidia_…"   2 seconds ago   Up 2 seconds               0.0.0.0:2222->22/tcp, [::]:2222->22/tcp   dl-workspace

通过SSH服务连接到容器:

C:\Users\39162>ssh root@192.168.100.120 -p 2222
root@192.168.100.120's password:
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-152-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.
Last login: Tue Oct 14 07:07:33 2025 from 192.168.100.123
root@65ecdea83e97:~# cd /workspace/
root@65ecdea83e97:/workspace# ll
total 100
drwxr-xr-x 1 root root  4096 Oct 14 07:05 ./
drwxr-xr-x 1 root root  4096 Oct 14 07:04 ../
-rw-r--r-- 1 root root 36690 Oct 14 06:02 1.jpg
-rw-r--r-- 1 root root   898 Oct 14 07:01 Dockerfile
drwxr-xr-x 5 root root  4096 Oct 14 06:02 PennFudanPed/
-rw-r--r-- 1 root root  4724 Oct 14 06:02 README.md
-rw-r--r-- 1 root root   492 Oct 14 06:55 basic.py
drwxr-xr-x 3 root root  4096 Oct 14 06:02 detection/
-rw-r--r-- 1 root root  2526 Oct 14 06:02 predict.py
-rw-r--r-- 1 root root  2301 Oct 14 06:02 predict_vision.py
-rw-r--r-- 1 root root   108 Oct 14 06:10 requirements.txt
-rw-r--r-- 1 root root    69 Oct 14 06:12 run.sh
drwxr-xr-x 2 root root  4096 Oct 14 06:02 src/
-rw-r--r-- 1 root root  4036 Oct 14 06:02 test_io.py
-rw-r--r-- 1 root root  7387 Oct 14 06:02 train.py
root@65ecdea83e97:/workspace#

结语:拥抱容器化的深度学习未来

通过Docker容器化技术,我们成功解决了深度学习训练中的环境依赖、版本管理和资源调度等核心痛点。从单机训练到分布式集群,从本地开发到生产部署,Docker为我们提供了一致、可靠、高效的训练环境。

关键收获

  • ✅ 环境一致性:告别"在我机器上能跑"的问题

  • ✅ 快速部署:新成员分钟级上手

  • ✅ 资源隔离:多项目并行无冲突

  • ✅ 集群调度:充分利用GPU资源

  • ✅ 成本优化:提高硬件利用率

随着云原生和容器化技术的不断发展,基于Docker的深度学习训练将成为行业标准。掌握这些技能,不仅能够提升个人开发效率,更能为团队协作和项目部署带来质的飞跃。


附录:常用命令速查表

# 镜像管理
docker build -t my-training .                    # 构建镜像
docker images                                    # 查看镜像
docker push myregistry.com/training:v1.0         # 推送镜像

# 容器运行
docker run --gpus all training-image            # 使用GPU
docker run -v /host/path:/container/path        # 挂载卷
docker run -e ENV_VAR=value                     # 设置环境变量

# 集群管理
kubectl apply -f training-job.yaml              # 部署Kubernetes任务
kubectl get pods -l job-name=training           # 查看任务状态

# 监控调试
docker logs container_name                      # 查看容器日志
docker exec -it container_name bash             # 进入容器
nvidia-smi                                      # 查看GPU状态
Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐