Skip to content

[X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets #90748

Closed
@RKSimon

Description

@RKSimon

For the default SSE2 implementation we extend to vXi16, perform the multiplication and pack the results back to vXi8.

But if we have the (v)pmaddubsw instruction, we can zero out the odd/even parts of the i8-pairs of one of the operands, perform the 2 pmaddubsw calls and then shift+or them back together, all with a single mask.

__m128i _mm_mul_epi8(__m128i x, __m128i y) {
    __m128i m = _mm_set1_epi16(255);
    __m128i ylo = _mm_and_si128(m, y);
    __m128i yhi = _mm_andnot_si128(m, y);
    __m128i lo = _mm_maddubs_epi16(x, ylo);
    __m128i hi = _mm_maddubs_epi16(x, yhi);
    lo = _mm_and_si128(lo, m);
    hi = _mm_slli_epi16(hi, 8);
    return _mm_or_si128(lo, hi);
}
  vmovaps .LCPI0_2(%rip), %xmm5
  vpand %xmm2, %xmm1, %xmm3
  vpandn %xmm2, %xmm1, %xmm4
  vpmaddubsw %xmm3, %xmm0, %xmm3
  vpmaddubsw %xmm4, %xmm0, %xmm4
  vpand %xmm5, %xmm3, %xmm3
  vpsllw $8, %xmm4, %xmm4
  vpor %xmm4, %xmm3, %xmm4

llvm-mca analysis - https://llvm.godbolt.org/z/9361GKrds - most CPUs benefit from this, but SandyBridge appears to be more borderline, it could be that we begin by initially trying this for multiple-by-constants (and shl-by-constants?)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions