[X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets

For the default SSE2 implementation we extend to vXi16, perform the multiplication and pack the results back to vXi8.

But if we have the (v)pmaddubsw instruction, we can zero out the odd/even parts of the i8-pairs of one of the operands, perform the 2 pmaddubsw calls and then shift+or them back together, all with a single mask.
```cpp
__m128i _mm_mul_epi8(__m128i x, __m128i y) {
    __m128i m = _mm_set1_epi16(255);
    __m128i ylo = _mm_and_si128(m, y);
    __m128i yhi = _mm_andnot_si128(m, y);
    __m128i lo = _mm_maddubs_epi16(x, ylo);
    __m128i hi = _mm_maddubs_epi16(x, yhi);
    lo = _mm_and_si128(lo, m);
    hi = _mm_slli_epi16(hi, 8);
    return _mm_or_si128(lo, hi);
}
```
```asm
  vmovaps .LCPI0_2(%rip), %xmm5
  vpand %xmm2, %xmm1, %xmm3
  vpandn %xmm2, %xmm1, %xmm4
  vpmaddubsw %xmm3, %xmm0, %xmm3
  vpmaddubsw %xmm4, %xmm0, %xmm4
  vpand %xmm5, %xmm3, %xmm3
  vpsllw $8, %xmm4, %xmm4
  vpor %xmm4, %xmm3, %xmm4
```
llvm-mca analysis - https://llvm.godbolt.org/z/9361GKrds - most CPUs benefit from this, but SandyBridge appears to be more borderline, it could be that we begin by initially trying this for multiple-by-constants (and shl-by-constants?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets #90748

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets #90748

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions