Error 304: OS call failed or operation not supported on this OS (Triggered internally at …/c10/cuda/

摘要：在CentOS 7.9系统上运行Docker容器时出现CUDA初始化错误(Error 304)，主要原因是NVIDIA UVM内核模块未加载或设备节点缺失。解决方案包括：1)在宿主机手动加载nvidia-uvm模块并创建设备节点；2)配置Docker默认运行时为nvidia；3)以privileged模式重新启动容器。验证表明该方法能有效解决torch.cuda.is_available()

dream_home8407

651人浏览 · 2026-02-27 15:00:42

dream_home8407 · 2026-02-27 15:00:42 发布

我在迁移的环境中执行上述命令，依旧是出现问题，请仔细分析原因，如何解决[root@node-2 ~]# docker run --gpus all -itd \

–name yolo_train_new \

-e NVIDIA_DRIVER_CAPABILITIES=all \

-v /data/:/data \

-p 8080:8080 \

yolo_train:v1

d66957373960fde35613eb64277281a5f58d5e88edaa7d5127b1dd52037f1f64

[root@node-2 ~]# docker exec -it d6695 bash

root@d66957373960:/ultralytics# python

Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux

Type “help”, “copyright”, “credits” or “license” for more information.

import torch

torch.cuda.is_available()

/opt/conda/lib/python3.11/site-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:108.)

return torch._C._cuda_getDeviceCount() > 0

False

exit()

root@d66957373960:/ultralytics# nvidia-smi

Thu Feb 26 06:35:54 2026

±--------------------------------------------------------------------------------------+

| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |

|-----------------------------------------±---------------------±---------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=++====|

| 0 NVIDIA A800 80GB PCIe Off | 00000000:1B:00.0 Off | 0 |

| N/A 52C P0 73W / 300W | 0MiB / 81920MiB | 0% Default |

| | | Disabled |

±----------------------------------------±---------------------±---------------------

针对您在 CentOS 7.9 上遇到的 Error 304: OS call failed 错误，这通常是由于 NVIDIA UVM (Unified Memory) 内核模块在宿主机上未正常加载或设备节点缺失导致的。即使 nvidia-smi 能看到显卡，如果 UVM 模块没加载，CUDA 初始化（即 torch.cuda.is_available()）也会失败。

核心原因分析

Error 304 在 Linux 上通常对应“系统调用失败”。在 CentOS 7 上，这多半是因为：

宿主机未加载 nvidia-uvm 模块：CUDA 11+ 严重依赖此模块进行内存管理。
/dev/nvidia-uvm 设备节点丢失：导致容器内虽然能看到显卡，但无法进行初始化操作。

最终解决方案步骤

请在您的 CentOS 7.9 宿主机（node-2）上按顺序执行以下命令：

1. 强制加载 NVIDIA UVM 模块

执行以下命令手动加载模块并创建设备节点：

# 检查模块是否加载
lsmod | grep nvidia_uvm

# 如果没有输出，则手动加载
sudo modprobe nvidia-uvm

# 手动创建设备节点（如果 /dev/nvidia-uvm 不存在）
if [ ! -c /dev/nvidia-uvm ]; then
  sudo mknod -m 666 /dev/nvidia-uvm c $(grep nvidia-uvm /proc/devices | cut -d \  -f 1) 0
fi

2. 设置 Docker 默认运行时（关键）

确保您的 /etc/docker/daemon.json 配置中包含了 default-runtime。这能确保所有容器启动时都默认加载 NVIDIA 驱动环境。

编辑 /etc/docker/daemon.json：

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

然后重启 Docker：

sudo systemctl daemon-reload
sudo systemctl restart docker

3. 以 --privileged 模式重新启动容器

在 CentOS 7 上，由于 cgroup v1 的限制，有时需要更高的权限来让容器初始化 CUDA。请使用以下命令启动：

docker run --gpus all --privileged -itd \
  --name yolo_train_final \
  -e NVIDIA_DRIVER_CAPABILITIES=all \
  -v /data/:/data \
  -p 8081:8081 \
  yolo_train:v1

验证方法

进入容器后，运行以下 Python 代码。如果返回 True，则说明问题已彻底解决：

import torch
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Device Name: {torch.cuda.get_device_name(0)}")

为什么之前的 setenforce 0 没起作用？
因为 Error 304 的优先级通常高于 SELinux 拦截。如果内核模块根本没加载，或者设备文件在 /dev 下不存在，即使关闭了 SELinux，CUDA 依然无法通过系统调用找到显卡。执行 modprobe nvidia-uvm 是解决此问题的“杀手锏”。