Matt Scarpino gives a good explanation (although he admits he's not sure it's the optimal algorithm, I offer him my gratitude) for how to multiply two complex doubles with Intel's AVX intrinsics. Here's his method, which I've verified:
__m256d vec1 = _mm256_setr_pd(4.0, 5.0, 13.0, 6.0); __m256d vec2 = _mm256_setr_pd(9.0, 3.0, 6.0, 7.0); __m256d neg = _mm256_setr_pd(1.0, -1.0, 1.0, -1.0); /* Step 1: Multiply vec1 and vec2 */ __m256d vec3 = _mm256_mul_pd(vec1, vec2); /* Step 2: Switch the real and imaginary elements of vec2 */ vec2 = _mm256_permute_pd(vec2, 0x5); /* Step 3: Negate the imaginary elements of vec2 */ vec2 = _mm256_mul_pd(vec2, neg); /* Step 4: Multiply vec1 and the modified vec2 */ __m256d vec4 = _mm256_mul_pd(vec1, vec2); /* Horizontally subtract the elements in vec3 and vec4 */ vec1 = _mm256_hsub_pd(vec3, vec4); /* Display the elements of the result vector */ double* res = (double*)&vec1; printf("%lf %lf %lf %lf\n", res, res, res, res);
My problem is that I want to square two complex doubles. I tried to use Matt's technique like so:
struct cmplx a; struct cmplx b; a.r = 2.5341; a.i = 1.843; b.r = 1.3941; b.i = 0.93; __m256d zzs = squareZ(a, b); double* res = (double*) &zzs; printf("\nA: %f + %f, B: %f + %f\n", res, res, res, res);
Using Haskell's complex arithmetic, I have verified the results are correct except, as you can see, the real part of B:
A: 3.025014 + 9.340693, B: 0.000000 + 2.593026
So I have two questions really: is there a better (simpler and/or faster) way to square two complex doubles with AVX intrinsics? If not, how can I modify Matt's code to do it?