Open
Description
Seen at 8d9770 (may occur earlier)
Intermittent.
Test:
python local_test_ddp.py
Log:
Traceback (most recent call last):
File "/home/kw2501/PiPPy/test/local_test_ddp.py", line 245, in <module>
mp.spawn(run_worker, args=(args.world_size, args,), nprocs=args.world_size, join=True)
File "/home/kw2501/pytorch/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/kw2501/pytorch/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/kw2501/pytorch/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/kw2501/pytorch/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/kw2501/PiPPy/test/local_test_ddp.py", line 213, in run_worker
run_master(args, pp_ranks_per_dp_group[rank])
File "/home/kw2501/PiPPy/test/local_test_ddp.py", line 167, in run_master
raise AssertionError(f'Gradients not close: {not_close_grads}')
AssertionError: Gradients not close: ['split_gm.submod_0.moved_module_mm_param', 'split_gm.submod_1.moved_module_lin_w']