以前跑的时候不报错,隔了几个月跑报错了。

错误一

RuntimeError: strides() called on an undefined Tensor。
RuntimeError: strides() called on an undefined Tensor。
RuntimeError: strides() called on an undefined Tensor。

这个是script_model.save(os.path.join(model_dir, ‘init.zip’))地方报的错,就是init.zip保存的时候有问题,init.zip文件的大小我看是不对的。

解决方法

这个其实没有解决,把这个相关代码删除了,就不报错了,项目也用不到。

-        script_model = torch.jit.script(model)
-        script_model.save(os.path.join(model_dir, 'init.zip'))
+        #script_model = torch.jit.script(model)
+        #script_model.save(os.path.join(model_dir, 'init.zip'))```

错误二

torch.distributed.elastic.multiprocessing.errors.ChildFailedError
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
torch.distributed.elastic.multiprocessing.errors.ChildFailedError

详细错误:

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1586 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1585) of binary: /home/work/miniconda3/bin/python
Traceback (most recent call last):
  File "/home/work/miniconda3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.0', 'console_scripts', 'torchrun')())
  File "/home/work/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/work/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/work//miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/work/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/work//miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
wespeaker/bin/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-06_10:13:49
  host      : tjtx178-33-25.58os.org
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1585)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
========================================================

错误原因:(只是其中一种原因,不代表所有,可以尝试用下面的方法解决)

CPU内存不够。

解决方法

  1. 重新建立docker 容器,shm设置为较大的数值比如150G
  2. 减小batch_size
Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐