深度模型训练错误
以前跑的时候不报错,隔了几个月跑报错了。
·
以前跑的时候不报错,隔了几个月跑报错了。
错误一
RuntimeError: strides() called on an undefined Tensor。
RuntimeError: strides() called on an undefined Tensor。
RuntimeError: strides() called on an undefined Tensor。
这个是script_model.save(os.path.join(model_dir, ‘init.zip’))地方报的错,就是init.zip保存的时候有问题,init.zip文件的大小我看是不对的。
解决方法
这个其实没有解决,把这个相关代码删除了,就不报错了,项目也用不到。
- script_model = torch.jit.script(model)
- script_model.save(os.path.join(model_dir, 'init.zip'))
+ #script_model = torch.jit.script(model)
+ #script_model.save(os.path.join(model_dir, 'init.zip'))```
错误二
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
详细错误:
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1586 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1585) of binary: /home/work/miniconda3/bin/python
Traceback (most recent call last):
File "/home/work/miniconda3/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.13.0', 'console_scripts', 'torchrun')())
File "/home/work/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/work/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/work//miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/work/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/work//miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
wespeaker/bin/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-06_10:13:49
host : tjtx178-33-25.58os.org
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1585)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
========================================================
错误原因:(只是其中一种原因,不代表所有,可以尝试用下面的方法解决)
CPU内存不够。
解决方法
- 重新建立docker 容器,shm设置为较大的数值比如150G
- 减小batch_size
更多推荐



所有评论(0)