NVIDIA-SMI Shows ERR! 显卡出现错误!
Sun Jan6 17:15:44 2019+-----------------------------------------------------------------------------+| NVIDIA-SMI 410.79Driver Version: 410.79CUDA Version: 10.0||-------------------------------+------
·
Sun Jan 6 17:15:44 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:06:00.0 Off | N/A |
| 62% 78C P2 256W / 260W | 9995MiB / 10989MiB | 90% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14277 C ...se/build/examples/openpose/openpose.bin 9985MiB |
+-----------------------------------------------------------------------------+
Sun Jan 6 17:16:19 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:06:00.0 Off | N/A |
| 52% 63C P8 35W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Sun Jan 6 17:19:18 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:06:00.0 Off | N/A |
|ERR! 55C P0 ERR! / 260W | 23MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3518 C - 13MiB |
+-----------------------------------------------------------------------------+
在进行一次深度学习分布式训练之后,显卡出现了问题,进过一番排查,可以定位显卡温度过高导致显卡出现ERROR.
显卡设备是GTX 1080 Ti GPUs running on driver 410.78
解决方式:
1. 将你的工作站或者服务器报错的显卡放置到温度低的地方.如果你没有动服务器硬件的权限,继续往下。
2. 设置持久化模式
sudo nvidia-smi -pm 1
3. 调整运行功率,保证最大功率时候的温度不会超过75C
sudo nvidia-smi -pl 200
如果你想更加安全,也可以将功率调的更小一些。
参考链接:
https://forums.developer.nvidia.com/t/nvidia-smi-shows-err-on-both-fan-and-power-usage/68293/13
更多推荐
所有评论(0)