Skip to content

BERT fails with not-close tensor when tested on GPUs #139

Open
@kwen2501

Description

@kwen2501

Reproducer:

python ./test/local_test_forward_hf_bert.py --cuda 1

Output:

REPLICATE config: 1 -> MultiUseParameterConfig.REPLICATE
/home/ec2-user/actions-runner/_work/PiPPy/PiPPy/build_binary_3.8/lib/python3.8/site-packages/transformers/activations.py:56: UserWarning: Defining your `__torch_function__` as a plain method is deprecated and will be an error in future, please define it as a classmethod. (Triggered internally at  ../torch/csrc/utils/python_arg_parser.cpp:289.)
  return self.act(input)
Using schedule: 1F1B
Instantiating BERT Pipeline
...
Traceback (most recent call last):
  File "test/local_test_forward_hf_bert.py", line 155, in <module>
    mp.spawn(run_worker, args=(args.world_size, args,), nprocs=args.world_size, join=True)
  File "/home/ec2-user/actions-runner/_work/PiPPy/PiPPy/build_binary_3.8/lib64/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/ec2-user/actions-runner/_work/PiPPy/PiPPy/build_binary_3.8/lib64/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/ec2-user/actions-runner/_work/PiPPy/PiPPy/build_binary_3.8/lib64/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/ec2-user/actions-runner/_work/PiPPy/PiPPy/build_binary_3.8/lib64/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/ec2-user/actions-runner/_work/PiPPy/PiPPy/test/local_test_forward_hf_bert.py", line 139, in run_worker
    run_master(args)
  File "/home/ec2-user/actions-runner/_work/PiPPy/PiPPy/test/local_test_forward_hf_bert.py", line 98, in run_master
    torch.testing.assert_close(out['last_hidden_state'], ref_out['last_hidden_state'])
  File "/home/ec2-user/actions-runner/_work/PiPPy/PiPPy/build_binary_3.8/lib64/python3.8/site-packages/torch/testing/_comparison.py", line 1304, in assert_close
    assert_equal(
  File "/home/ec2-user/actions-runner/_work/PiPPy/PiPPy/build_binary_3.8/lib64/python3.8/site-packages/torch/testing/_comparison.py", line 1074, in assert_equal
    raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!

Mismatched elements: 393212 / 491520 (80.0%)
Greatest absolute difference: 4.328115940093994 at index (9, 12, 223) (up to 1e-05 allowed)
Greatest relative difference: 1.0 at index (4, 0, 0) (up to 1.3e-06 allowed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghigh-priHigh priority

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions