Skip to content

DDP + CUDA gives "Gradients not close" #165

Open
@kwen2501

Description

@kwen2501

Seen at 8d9770 (may occur earlier)
Intermittent.

Test:

python local_test_ddp.py

Log:

Traceback (most recent call last):
  File "/home/kw2501/PiPPy/test/local_test_ddp.py", line 245, in <module>
    mp.spawn(run_worker, args=(args.world_size, args,), nprocs=args.world_size, join=True)
  File "/home/kw2501/pytorch/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/kw2501/pytorch/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/kw2501/pytorch/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/kw2501/pytorch/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/kw2501/PiPPy/test/local_test_ddp.py", line 213, in run_worker
    run_master(args, pp_ranks_per_dp_group[rank])
  File "/home/kw2501/PiPPy/test/local_test_ddp.py", line 167, in run_master
    raise AssertionError(f'Gradients not close: {not_close_grads}')
AssertionError: Gradients not close: ['split_gm.submod_0.moved_module_mm_param', 'split_gm.submod_1.moved_module_lin_w']

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghigh-priHigh priority

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions