How to square two complex doubles with 256-bit AVX vectors?

By : sacheie

Matt Scarpino gives a good explanation (although he admits he's not sure it's the optimal algorithm, I offer him my gratitude) for how to multiply two complex doubles with Intel's AVX intrinsics. Here's his method, which I've verified:

__m256d vec1 = _mm256_setr_pd(4.0, 5.0, 13.0, 6.0);
__m256d vec2 = _mm256_setr_pd(9.0, 3.0, 6.0, 7.0);
__m256d neg  = _mm256_setr_pd(1.0, -1.0, 1.0, -1.0);

/* Step 1: Multiply vec1 and vec2 */
__m256d vec3 = _mm256_mul_pd(vec1, vec2);

/* Step 2: Switch the real and imaginary elements of vec2 */
vec2 = _mm256_permute_pd(vec2, 0x5);

/* Step 3: Negate the imaginary elements of vec2 */
vec2 = _mm256_mul_pd(vec2, neg);  

/* Step 4: Multiply vec1 and the modified vec2 */
__m256d vec4 = _mm256_mul_pd(vec1, vec2);

/* Horizontally subtract the elements in vec3 and vec4 */
vec1 = _mm256_hsub_pd(vec3, vec4);

/* Display the elements of the result vector */
double* res = (double*)&vec1;
printf("%lf %lf %lf %lf\n", res[0], res[1], res[2], res[3]);

My problem is that I want to square two complex doubles. I tried to use Matt's technique like so:

struct cmplx a;
struct cmplx b;

a.r = 2.5341;
a.i = 1.843;

b.r = 1.3941;
b.i = 0.93;

__m256d zzs = squareZ(a, b);

double* res = (double*) &zzs;

printf("\nA: %f + %f,  B: %f + %f\n", res[0], res[1], res[2], res[3]);

Using Haskell's complex arithmetic, I have verified the results are correct except, as you can see, the real part of B:

A: 3.025014 + 9.340693,  B: 0.000000 + 2.593026

So I have two questions really: is there a better (simpler and/or faster) way to square two complex doubles with AVX intrinsics? If not, how can I modify Matt's code to do it?

By : sacheie


This answer covers the general case of multiplying two arrays of complex numbers

Ideally, store your data in separate real and imaginary arrays, so you can just load contiguous vectors of real and imaginary parts. That makes it free to do the cross-multiplying (just use different registers / variables) instead of having to shuffle things around within a vector.

You can convert between interleaved double complex style and SIMD-friendly separate-vectors style on the fly fairly cheaply, subject to the vagaries of AVX in-lane shuffles. e.g. very cheaply with unpacklo / unpackhi shuffles to de-interleave or to re-interleave within a lane, if you don't care about the actual order of the data within the temporary vector.

It's actually so cheap to do this shuffle that doing it on the fly for a single complex multiply comes out somewhat ahead of (even a tweaked version of) Matt's code, especially on CPUs that support FMA. This requires producing results in groups of 4 complex doubles (2 result vectors).

If you need to produce only one result vector at a time, I also came up with an alternative to Matt's algorithm that can use FMA (actually FMADDSUB) and avoid the separate sign-change insn.

gcc auto-vectorizes simple complex multiply scalar loop to pretty good code, as long as you use -ffast-math. It deinterleaves like I suggested.


See my other answer for the general case of multiplying different complex numbers, not squaring.

TL:DR: just use the code in my other answer with both inputs the same. Compilers do a good job with the redundancy.

Squaring simplifies the math slightly: instead of needing two different cross products, rAiB and rBiA are the same. But it still needs to get doubled, so basically we end up with 2 mul 1 FMA 1 add, instead of 2 mul 2 FMA.

With the SIMD-unfriendly interleaved storage format, it gives a big boost to the deinterleave method, since there's only one input to shuffle. Matt's method doesn't benefit at all, since it calculates both cross products with the same vector multiply.

Using the cmul_manualvec() from my other answer:

// squares 4 complex doubles from A[0..3], storing the result in dst[0..3]
void csquare_manual(double complex *restrict dst,
          const double complex *restrict A) {
  cmul_manualvec(dst, A, A);

gcc and clang are smart enough to optimize away the redundancy of using the same input twice, so there's no need to make a custom version with intrinsics. clang does a bad job on the scalar auto-vectorizing version, so don't use that. I don't see anything to be gained over this asm output By : Peter Cordes

This video can help you solving your question :)
By: admin