Port TH operators to Aten (umbrella issue) #24507

VitalyFedyunin · 2019-08-16T18:54:23Z

Porting guide: https://github.com/pytorch/pytorch/wiki/TH-to-ATen-porting-guide

Example PR with porting of the adaptive_avg_pool2d: #14714

Example PR with porting of the point-wise operator addcmul: #22874

How to use TensorIterator (if needed): https://github.com/pytorch/pytorch/wiki/How-to-use-TensorIterator

Notes:

While porting operators please include [namedtensor ci] into commit message, it will start names propagation test.

Issues were automatically generated based on legacy:: dispatch rules in native_functions.yaml and operators cname in Declarations.cwrap

CUDA Ops

CPU Ops

Pointwise bit-ops

Completed (2019-11-20)

The text was updated successfully, but these errors were encountered:

Summary: Fixed #24538 Related #24507 Pull Request resolved: #37991 Differential Revision: D21531741 Pulled By: VitalyFedyunin fbshipit-source-id: c762cc80416d7fffbb1769c6cc5e0914ceaa8e2d

Summary: Fixed #24559 Reference #24507 Pull Request resolved: #38373 Differential Revision: D21549626 Pulled By: ezyang fbshipit-source-id: 84c2cf58b071df3afc312ae0aef3b5ed6c014cc7

Summary: References: #24521 #24522 #24547 #24548 #24507 Depends on #36308 Changes related to this PR are only in file : aten/src/ATen/Declarations.cwrap aten/src/ATen/native/cuda/ReduceOpsKernel.cu aten/src/ATen/native/native_functions.yaml aten/src/THC/generic/THCTensorMathScan.cu aten/src/THC/generic/THCTensorMathScan.h Please Review VitalyFedyunin Thanks. Pull Request resolved: #36458 Differential Revision: D21718384 Pulled By: ngimel fbshipit-source-id: 5af15164050c77be164397abd659a48c9ded2b29

Summary: Related #24507 Fixes #24666 This PR is to modernize the CPU implementation of the vector `outer product`. The existing TH implementation for `torch.attr` is migrated to `aten`, as the `torch.ger` manipulates the `addr` functions to calculate outer product, Pull Request resolved: #44364 Reviewed By: ezyang Differential Revision: D23866733 Pulled By: mruberry fbshipit-source-id: 5159ea22f0e3c991123fe7c19cc9beb6ad00301e

anjali411 · 2020-12-17T16:03:51Z

I added it, but given that it was missing, I'm wondering if there are other ops we are missing.

Mmm just created issues for masked_scatter and masked_fill which were also missing from the list.

Summary: Fixes #49541 Reference: #24507 Pull Request resolved: #49732 Reviewed By: ejguan Differential Revision: D25991438 Pulled By: ngimel fbshipit-source-id: a43bd0bfe043d8e32a6cadbbf736a0eaa697e7ec

Summary: Fixes #24731 #24673 #24597 #24526 #46507 Related #24507 Pull Request resolved: #52043 Reviewed By: mruberry Differential Revision: D27468266 Pulled By: ngimel fbshipit-source-id: 35a3229c2a706da9bad4ccd0070161831e5476ba

Summary: Fixes #24618 Related to #24507 <details><summary>Benchmark script:</summary> ```py import torch import torch.nn as nn import time torch.manual_seed(0) def _time(): torch.cuda.synchronize() return time.time() device = "cuda" m = nn.RReLU().cuda() for n in [100, 10_000, 100_000]: fwd_t = 0 bwd_t = 0 input = torch.randn(128, n, device=device) grad_output = torch.ones(128, n, device=device) for i in range(10000): t1 = _time() output = m(input) t2 = _time() fwd_t = fwd_t + (t2 -t1) fwd_avg = fwd_t / 10000 * 1000 print(f"input size(128, {n}) forward time is {fwd_avg:.2f} (ms)") ``` </details> ### Results from benchmark: #### This PR ``` input size(128, 100) forward time is 0.01 (ms) input size(128, 10000) forward time is 0.06 (ms) input size(128, 100000) forward time is 0.54 (ms) ``` #### On master ``` input size(128, 100) forward time is 0.01 (ms) input size(128, 10000) forward time is 0.08 (ms) input size(128, 100000) forward time is 0.66 (ms) ``` Pull Request resolved: #57864 Reviewed By: H-Huang Differential Revision: D29177169 Pulled By: ngimel fbshipit-source-id: 4572133db06f143d27e70a91ade977ea962c8f77

Summary: Fixes #24609 Aten Umbrella issue #24507 Related to #59765 There are no performance differences when running the following benchmark: <details> <summary>Benchmark script</summary> ```python import torch import torch.nn as nn import time torch.manual_seed(0) def _time(): torch.cuda.synchronize() MS_PER_SECOND = 1000 return time.perf_counter() * MS_PER_SECOND device = "cuda" C = 30 softmax = nn.LogSoftmax(dim=1) n_runs = 250 for reduction in ["none", "mean", "sum"]: for N in [100_000, 500_000, 1_000_000]: elapsed = 0 for i in range(n_runs): data = torch.randn(N, C, device=device, requires_grad=True) target = torch.empty(N, dtype=torch.long, device=device).random_(0, C) loss = nn.NLLLoss(reduction=reduction) input = softmax(data) result = loss(input, target) if reduction == "none": gradient = torch.randn(N, device=device) else: gradient = torch.randn(1, device=device).squeeze() t1 = _time() result.backward(gradient) t2 = _time() elapsed = elapsed + (t2 - t1) elapsed_avg = elapsed / n_runs print( f"input size({N}, {C}), reduction: {reduction} " f"elapsed time is {elapsed_avg:.2f} (ms)" ) print() ``` </details> ## master ``` input size(100000, 30), reduction: none elapsed time is 0.19 (ms) input size(500000, 30), reduction: none elapsed time is 0.83 (ms) input size(1000000, 30), reduction: none elapsed time is 1.66 (ms) input size(100000, 30), reduction: mean elapsed time is 1.50 (ms) input size(500000, 30), reduction: mean elapsed time is 7.19 (ms) input size(1000000, 30), reduction: mean elapsed time is 14.35 (ms) input size(100000, 30), reduction: sum elapsed time is 1.49 (ms) input size(500000, 30), reduction: sum elapsed time is 7.17 (ms) input size(1000000, 30), reduction: sum elapsed time is 14.21 (ms) ``` ## this PR ``` input size(100000, 30), reduction: none elapsed time is 0.19 (ms) input size(500000, 30), reduction: none elapsed time is 0.83 (ms) input size(1000000, 30), reduction: none elapsed time is 1.66 (ms) input size(100000, 30), reduction: mean elapsed time is 1.48 (ms) input size(500000, 30), reduction: mean elapsed time is 7.16 (ms) input size(1000000, 30), reduction: mean elapsed time is 14.29 (ms) input size(100000, 30), reduction: sum elapsed time is 1.49 (ms) input size(500000, 30), reduction: sum elapsed time is 7.15 (ms) input size(1000000, 30), reduction: sum elapsed time is 14.18 (ms) ``` Pull Request resolved: #60299 Reviewed By: albanD Differential Revision: D29287613 Pulled By: ngimel fbshipit-source-id: 21e15f2c518087e9fb797a379e1e0a3508c98509

Summary: Ref #24507 (There doesn't seem to be an actual issue for cross) This also moves the remaining operator functors in `THCTensorMathPointwise.cuh` to `SparseCUDATensorMath.cu` which is the only file using them. Pull Request resolved: #60039 Reviewed By: mrshenli Differential Revision: D29314638 Pulled By: ngimel fbshipit-source-id: aa7b57f6e11a933fb44f044e26945bb4a9e3de5f

Summary: Fixes #24610 Aten Umbrella issue #24507 Related to #59765 The performance does not change between this PR and master with the following benchmark script: <details> <summary>Benchmark script</summary> ```python import torch import torch.nn as nn import time torch.manual_seed(0) def _time(): torch.cuda.synchronize() MS_PER_SECOND = 1000 return time.perf_counter() * MS_PER_SECOND device = "cuda" C = 30 softmax = nn.LogSoftmax(dim=1) n_runs = 250 for reduction in ["none", "mean", "sum"]: for N in [100_000, 500_000, 1_000_000]: fwd_t = 0 bwd_t = 0 data = torch.randn(N, C, device=device) target = torch.empty(N, dtype=torch.long, device=device).random_(0, C) loss = nn.NLLLoss(reduction=reduction) input = softmax(data) for i in range(n_runs): t1 = _time() result = loss(input, target) t2 = _time() fwd_t = fwd_t + (t2 - t1) fwd_avg = fwd_t / n_runs print( f"input size({N}, {C}), reduction: {reduction} " f"forward time is {fwd_avg:.2f} (ms)" ) print() ``` </details> ## master ``` input size(100000, 30), reduction: none forward time is 0.02 (ms) input size(500000, 30), reduction: none forward time is 0.08 (ms) input size(1000000, 30), reduction: none forward time is 0.15 (ms) input size(100000, 30), reduction: mean forward time is 1.81 (ms) input size(500000, 30), reduction: mean forward time is 8.24 (ms) input size(1000000, 30), reduction: mean forward time is 16.46 (ms) input size(100000, 30), reduction: sum forward time is 1.66 (ms) input size(500000, 30), reduction: sum forward time is 8.24 (ms) input size(1000000, 30), reduction: sum forward time is 16.46 (ms) ``` ## this PR ``` input size(100000, 30), reduction: none forward time is 0.02 (ms) input size(500000, 30), reduction: none forward time is 0.08 (ms) input size(1000000, 30), reduction: none forward time is 0.15 (ms) input size(100000, 30), reduction: mean forward time is 1.80 (ms) input size(500000, 30), reduction: mean forward time is 8.24 (ms) input size(1000000, 30), reduction: mean forward time is 16.46 (ms) input size(100000, 30), reduction: sum forward time is 1.66 (ms) input size(500000, 30), reduction: sum forward time is 8.24 (ms) input size(1000000, 30), reduction: sum forward time is 16.46 (ms) ``` Pull Request resolved: #60097 Reviewed By: mrshenli Differential Revision: D29303099 Pulled By: ngimel fbshipit-source-id: fc0d636543a79ea81158d286dcfb84043bec079a

rgommers · 2022-01-20T22:57:48Z

The last open checkboxes were all done already. 280/280 ported, let's declare victory here 🎉

VitalyFedyunin added module: operators triaged module: porting better-engineering labels Aug 16, 2019

This was referenced May 13, 2020

Implementing deg2rad, rad2deg #38372

Closed

Migrate erfc from TH to ATen (CUDA) #38373

Closed

RockingJavaBean mentioned this issue Sep 9, 2020

Migrate addr from the TH to Aten (CPU) #44364

Closed

mruberry removed the module: operators (deprecated) label Oct 10, 2020

This was referenced Dec 17, 2020

Migrate masked_scatter from TH to ATen (CPU) #49541

Closed

Migrate masked_scatter from TH to ATen (CUDA) #49542

Closed

Migrate masked_fill from TH to ATen (CUDA) #49543

Closed

kshitij12345 mentioned this issue Dec 22, 2020

Migrate masked_scatter_ CPU to ATen #49732

Closed

ysiraichi mentioned this issue Feb 10, 2021

Migrate mode from TH to ATen #52043

Closed

thomasjpfan mentioned this issue May 7, 2021

MAINT Migrates rrelu_with_noise from THC to ATen on Cuda #57864

Closed

peterbell10 mentioned this issue Jun 15, 2021

Migrate crossKernel from THC to ATen (CUDA) #60039

Closed

thomasjpfan mentioned this issue Jun 16, 2021

Migrates nll_loss_forward from TH to Aten (CUDA) #60097

Closed

thomasjpfan mentioned this issue Jun 18, 2021

Migrates nll_loss_backward from TH to Aten (CUDA) #60299

Closed

rgommers closed this as completed Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port TH operators to Aten (umbrella issue) #24507

Port TH operators to Aten (umbrella issue) #24507

VitalyFedyunin commented Aug 16, 2019 •

edited by rgommers

Loading

anjali411 commented Dec 17, 2020

rgommers commented Jan 20, 2022

Port TH operators to Aten (umbrella issue) #24507

Port TH operators to Aten (umbrella issue) #24507

Comments

VitalyFedyunin commented Aug 16, 2019 • edited by rgommers Loading

CUDA Ops

CPU Ops

Pointwise bit-ops

Completed (2019-11-20)

anjali411 commented Dec 17, 2020

rgommers commented Jan 20, 2022

VitalyFedyunin commented Aug 16, 2019 •

edited by rgommers

Loading