Skip to content

Files

Latest commit

Jun 27, 2020
26f1cda · Jun 27, 2020

History

History

syncbn-apex

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jun 27, 2020
Jun 27, 2020

sync batchnorm apex

tested on pytorch master

1.6.0a0+e180ca6
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib64/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib64/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "sbn.py", line 36, in init_processes
    fn(rank, size)
  File "sbn.py", line 24, in run
    b: torch.Tensor = ddp_net(a)
  File "/home/xwang/Developer/pytorch/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xwang/Developer/pytorch/torch/nn/parallel/distributed.py", line 507, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/xwang/Developer/pytorch/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xwang/.local/lib/python3.7/site-packages/apex/parallel/optimized_sync_batchnorm.py", line 85, in forward
    return SyncBatchnormFunction.apply(input, z, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, channel_last, self.fuse_relu)
  File "/home/xwang/.local/lib/python3.7/site-packages/apex/parallel/optimized_sync_batchnorm_kernel.py", line 36, in forward
    torch.distributed.all_gather(mean_l, mean, process_group)
  File "/home/xwang/Developer/pytorch/torch/distributed/distributed_c10d.py", line 1185, in all_gather
    work = _default_pg.allgather([tensor_list], [tensor])
RuntimeError: All tensor operands to scatter/gather must have the same size
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib64/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib64/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "sbn.py", line 36, in init_processes
    fn(rank, size)
  File "sbn.py", line 24, in run
    b: torch.Tensor = ddp_net(a)
  File "/home/xwang/Developer/pytorch/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xwang/Developer/pytorch/torch/nn/parallel/distributed.py", line 507, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/xwang/Developer/pytorch/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xwang/.local/lib/python3.7/site-packages/apex/parallel/optimized_sync_batchnorm.py", line 85, in forward
    return SyncBatchnormFunction.apply(input, z, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, channel_last, self.fuse_relu)
  File "/home/xwang/.local/lib/python3.7/site-packages/apex/parallel/optimized_sync_batchnorm_kernel.py", line 36, in forward
    torch.distributed.all_gather(mean_l, mean, process_group)
  File "/home/xwang/Developer/pytorch/torch/distributed/distributed_c10d.py", line 1185, in all_gather
    work = _default_pg.allgather([tensor_list], [tensor])
RuntimeError: All tensor operands to scatter/gather must have the same size

use torch.nn sync batchnorm

    # net = torch.nn.SyncBatchNorm(c)
    net = apex.parallel.SyncBatchNorm(c)

no crash