Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port TH operators to Aten (umbrella issue) #24507

Closed
VitalyFedyunin opened this issue Aug 16, 2019 · 7 comments
Closed

Port TH operators to Aten (umbrella issue) #24507

VitalyFedyunin opened this issue Aug 16, 2019 · 7 comments
Labels
better-engineering Relatively self-contained tasks for better engineering contributors module: porting Issues related to porting TH/THNN legacy to ATen native triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@VitalyFedyunin
Copy link
Contributor

VitalyFedyunin commented Aug 16, 2019

Porting guide: https://github.com/pytorch/pytorch/wiki/TH-to-ATen-porting-guide

Example PR with porting of the adaptive_avg_pool2d: #14714

Example PR with porting of the point-wise operator addcmul: #22874

How to use TensorIterator (if needed): https://github.com/pytorch/pytorch/wiki/How-to-use-TensorIterator

Notes:

  • While porting operators please include [namedtensor ci] into commit message, it will start names propagation test.

Issues were automatically generated based on legacy:: dispatch rules in native_functions.yaml and operators cname in Declarations.cwrap

CUDA Ops

CPU Ops

Pointwise bit-ops

Completed (2019-11-20)

@VitalyFedyunin VitalyFedyunin added module: operators triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: porting Issues related to porting TH/THNN legacy to ATen native better-engineering Relatively self-contained tasks for better engineering contributors labels Aug 16, 2019
This was referenced Aug 16, 2019
facebook-github-bot pushed a commit that referenced this issue May 12, 2020
Summary:
Fixed #24538
Related #24507
Pull Request resolved: #37991

Differential Revision: D21531741

Pulled By: VitalyFedyunin

fbshipit-source-id: c762cc80416d7fffbb1769c6cc5e0914ceaa8e2d
facebook-github-bot pushed a commit that referenced this issue May 14, 2020
Summary:
Fixed #24559
Reference #24507
Pull Request resolved: #38373

Differential Revision: D21549626

Pulled By: ezyang

fbshipit-source-id: 84c2cf58b071df3afc312ae0aef3b5ed6c014cc7
facebook-github-bot pushed a commit that referenced this issue May 26, 2020
Summary:
References: #24521 #24522 #24547 #24548 #24507

Depends on #36308

Changes related to this PR are only in file :
aten/src/ATen/Declarations.cwrap
aten/src/ATen/native/cuda/ReduceOpsKernel.cu
aten/src/ATen/native/native_functions.yaml
aten/src/THC/generic/THCTensorMathScan.cu
aten/src/THC/generic/THCTensorMathScan.h

Please Review VitalyFedyunin

Thanks.
Pull Request resolved: #36458

Differential Revision: D21718384

Pulled By: ngimel

fbshipit-source-id: 5af15164050c77be164397abd659a48c9ded2b29
facebook-github-bot pushed a commit that referenced this issue Sep 25, 2020
Summary:
Related #24507
Fixes #24666

This PR is to modernize the CPU implementation of the vector `outer product`.
The existing TH implementation for `torch.attr` is migrated to `aten`, as the `torch.ger` manipulates the `addr` functions to calculate outer product,

Pull Request resolved: #44364

Reviewed By: ezyang

Differential Revision: D23866733

Pulled By: mruberry

fbshipit-source-id: 5159ea22f0e3c991123fe7c19cc9beb6ad00301e
@anjali411
Copy link
Contributor

I added it, but given that it was missing, I'm wondering if there are other ops we are missing.

Mmm just created issues for masked_scatter and masked_fill which were also missing from the list.

facebook-github-bot pushed a commit that referenced this issue Jan 22, 2021
Summary:
Fixes #49541

Reference: #24507

Pull Request resolved: #49732

Reviewed By: ejguan

Differential Revision: D25991438

Pulled By: ngimel

fbshipit-source-id: a43bd0bfe043d8e32a6cadbbf736a0eaa697e7ec
facebook-github-bot pushed a commit that referenced this issue Apr 3, 2021
Summary:
Fixes #24731 #24673 #24597 #24526 #46507
Related #24507

Pull Request resolved: #52043

Reviewed By: mruberry

Differential Revision: D27468266

Pulled By: ngimel

fbshipit-source-id: 35a3229c2a706da9bad4ccd0070161831e5476ba
facebook-github-bot pushed a commit that referenced this issue Jun 17, 2021
Summary:
Fixes #24618
Related to #24507

<details><summary>Benchmark script:</summary>

```py
import torch
import torch.nn as nn
import time

torch.manual_seed(0)
def _time():
    torch.cuda.synchronize()
    return time.time()

device = "cuda"
m = nn.RReLU().cuda()

for n in [100, 10_000, 100_000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(10000):
        t1 = _time()
        output = m(input)
        t2 = _time()
        fwd_t = fwd_t + (t2 -t1)
    fwd_avg = fwd_t / 10000 * 1000
    print(f"input size(128, {n}) forward time is {fwd_avg:.2f} (ms)")
```

</details>

### Results from benchmark:

#### This PR

```
input size(128, 100) forward time is 0.01 (ms)
input size(128, 10000) forward time is 0.06 (ms)
input size(128, 100000) forward time is 0.54 (ms)
```

#### On master

```
input size(128, 100) forward time is 0.01 (ms)
input size(128, 10000) forward time is 0.08 (ms)
input size(128, 100000) forward time is 0.66 (ms)
```

Pull Request resolved: #57864

Reviewed By: H-Huang

Differential Revision: D29177169

Pulled By: ngimel

fbshipit-source-id: 4572133db06f143d27e70a91ade977ea962c8f77
facebook-github-bot pushed a commit that referenced this issue Jun 22, 2021
Summary:
Fixes #24609
Aten Umbrella issue #24507
Related to #59765

There are no performance differences when running the following benchmark:

<details>
 <summary>Benchmark script</summary>

```python
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    torch.cuda.synchronize()
    MS_PER_SECOND = 1000
    return time.perf_counter() * MS_PER_SECOND

device = "cuda"
C = 30
softmax = nn.LogSoftmax(dim=1)
n_runs = 250

for reduction in ["none", "mean", "sum"]:
    for N in [100_000, 500_000, 1_000_000]:
        elapsed = 0
        for i in range(n_runs):
            data = torch.randn(N, C, device=device, requires_grad=True)
            target = torch.empty(N, dtype=torch.long, device=device).random_(0, C)
            loss = nn.NLLLoss(reduction=reduction)
            input = softmax(data)
            result = loss(input, target)

            if reduction == "none":
                gradient = torch.randn(N, device=device)
            else:
                gradient = torch.randn(1, device=device).squeeze()

            t1 = _time()
            result.backward(gradient)
            t2 = _time()
            elapsed = elapsed + (t2 - t1)
        elapsed_avg = elapsed / n_runs
        print(
            f"input size({N}, {C}), reduction: {reduction} "
            f"elapsed time is {elapsed_avg:.2f} (ms)"
        )
    print()

```

</details>

## master

```
input size(100000, 30), reduction: none elapsed time is 0.19 (ms)
input size(500000, 30), reduction: none elapsed time is 0.83 (ms)
input size(1000000, 30), reduction: none elapsed time is 1.66 (ms)

input size(100000, 30), reduction: mean elapsed time is 1.50 (ms)
input size(500000, 30), reduction: mean elapsed time is 7.19 (ms)
input size(1000000, 30), reduction: mean elapsed time is 14.35 (ms)

input size(100000, 30), reduction: sum elapsed time is 1.49 (ms)
input size(500000, 30), reduction: sum elapsed time is 7.17 (ms)
input size(1000000, 30), reduction: sum elapsed time is 14.21 (ms)
```

## this PR

```
input size(100000, 30), reduction: none elapsed time is 0.19 (ms)
input size(500000, 30), reduction: none elapsed time is 0.83 (ms)
input size(1000000, 30), reduction: none elapsed time is 1.66 (ms)

input size(100000, 30), reduction: mean elapsed time is 1.48 (ms)
input size(500000, 30), reduction: mean elapsed time is 7.16 (ms)
input size(1000000, 30), reduction: mean elapsed time is 14.29 (ms)

input size(100000, 30), reduction: sum elapsed time is 1.49 (ms)
input size(500000, 30), reduction: sum elapsed time is 7.15 (ms)
input size(1000000, 30), reduction: sum elapsed time is 14.18 (ms)
```

Pull Request resolved: #60299

Reviewed By: albanD

Differential Revision: D29287613

Pulled By: ngimel

fbshipit-source-id: 21e15f2c518087e9fb797a379e1e0a3508c98509
facebook-github-bot pushed a commit that referenced this issue Jun 23, 2021
Summary:
Ref  #24507 (There doesn't seem to be an actual issue for cross)

This also moves the remaining operator functors in `THCTensorMathPointwise.cuh`  to `SparseCUDATensorMath.cu` which is the only file using them.

Pull Request resolved: #60039

Reviewed By: mrshenli

Differential Revision: D29314638

Pulled By: ngimel

fbshipit-source-id: aa7b57f6e11a933fb44f044e26945bb4a9e3de5f
facebook-github-bot pushed a commit that referenced this issue Jun 24, 2021
Summary:
Fixes #24610
Aten Umbrella issue #24507
Related to #59765

The performance does not change between this PR and master with the following benchmark script:

<details>
 <summary>Benchmark script</summary>

```python
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    torch.cuda.synchronize()
    MS_PER_SECOND = 1000
    return time.perf_counter() * MS_PER_SECOND

device = "cuda"
C = 30
softmax = nn.LogSoftmax(dim=1)
n_runs = 250

for reduction in ["none", "mean", "sum"]:
    for N in [100_000, 500_000, 1_000_000]:
        fwd_t = 0
        bwd_t = 0
        data = torch.randn(N, C, device=device)
        target = torch.empty(N, dtype=torch.long, device=device).random_(0, C)
        loss = nn.NLLLoss(reduction=reduction)
        input = softmax(data)

        for i in range(n_runs):
            t1 = _time()
            result = loss(input, target)
            t2 = _time()
            fwd_t = fwd_t + (t2 - t1)
        fwd_avg = fwd_t / n_runs
        print(
            f"input size({N}, {C}), reduction: {reduction} "
            f"forward time is {fwd_avg:.2f} (ms)"
        )
    print()
```

</details>

## master

```
input size(100000, 30), reduction: none forward time is 0.02 (ms)
input size(500000, 30), reduction: none forward time is 0.08 (ms)
input size(1000000, 30), reduction: none forward time is 0.15 (ms)

input size(100000, 30), reduction: mean forward time is 1.81 (ms)
input size(500000, 30), reduction: mean forward time is 8.24 (ms)
input size(1000000, 30), reduction: mean forward time is 16.46 (ms)

input size(100000, 30), reduction: sum forward time is 1.66 (ms)
input size(500000, 30), reduction: sum forward time is 8.24 (ms)
input size(1000000, 30), reduction: sum forward time is 16.46 (ms)
```

## this PR

```
input size(100000, 30), reduction: none forward time is 0.02 (ms)
input size(500000, 30), reduction: none forward time is 0.08 (ms)
input size(1000000, 30), reduction: none forward time is 0.15 (ms)

input size(100000, 30), reduction: mean forward time is 1.80 (ms)
input size(500000, 30), reduction: mean forward time is 8.24 (ms)
input size(1000000, 30), reduction: mean forward time is 16.46 (ms)

input size(100000, 30), reduction: sum forward time is 1.66 (ms)
input size(500000, 30), reduction: sum forward time is 8.24 (ms)
input size(1000000, 30), reduction: sum forward time is 16.46 (ms)
```

Pull Request resolved: #60097

Reviewed By: mrshenli

Differential Revision: D29303099

Pulled By: ngimel

fbshipit-source-id: fc0d636543a79ea81158d286dcfb84043bec079a
@rgommers
Copy link
Collaborator

The last open checkboxes were all done already. 280/280 ported, let's declare victory here 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
better-engineering Relatively self-contained tasks for better engineering contributors module: porting Issues related to porting TH/THNN legacy to ATen native triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants