Closed
Description
For the default SSE2 implementation we extend to vXi16, perform the multiplication and pack the results back to vXi8.
But if we have the (v)pmaddubsw instruction, we can zero out the odd/even parts of the i8-pairs of one of the operands, perform the 2 pmaddubsw calls and then shift+or them back together, all with a single mask.
__m128i _mm_mul_epi8(__m128i x, __m128i y) {
__m128i m = _mm_set1_epi16(255);
__m128i ylo = _mm_and_si128(m, y);
__m128i yhi = _mm_andnot_si128(m, y);
__m128i lo = _mm_maddubs_epi16(x, ylo);
__m128i hi = _mm_maddubs_epi16(x, yhi);
lo = _mm_and_si128(lo, m);
hi = _mm_slli_epi16(hi, 8);
return _mm_or_si128(lo, hi);
}
vmovaps .LCPI0_2(%rip), %xmm5
vpand %xmm2, %xmm1, %xmm3
vpandn %xmm2, %xmm1, %xmm4
vpmaddubsw %xmm3, %xmm0, %xmm3
vpmaddubsw %xmm4, %xmm0, %xmm4
vpand %xmm5, %xmm3, %xmm3
vpsllw $8, %xmm4, %xmm4
vpor %xmm4, %xmm3, %xmm4
llvm-mca analysis - https://llvm.godbolt.org/z/9361GKrds - most CPUs benefit from this, but SandyBridge appears to be more borderline, it could be that we begin by initially trying this for multiple-by-constants (and shl-by-constants?)