I am optimizing some code for an Intel x86 Nehalem micro-architecture using SSE intrinsics.
A portion of my program computes 4 dot products and adds each result to the previous values in a contiguous chunk of an array. More specifically,
tmp0 = _mm_dp_ps(A_0m, B_0m, 0xF1); tmp1 = _mm_dp_ps(A_1m, B_0m, 0xF2); tmp2 = _mm_dp_ps(A_2m, B_0m, 0xF4); tmp3 = _mm_dp_ps(A_3m, B_0m, 0xF8); tmp0 = _mm_add_ps(tmp0, tmp1); tmp0 = _mm_add_ps(tmp0, tmp2); tmp0 = _mm_add_ps(tmp0, tmp3); tmp0 = _mm_add_ps(tmp0, C_0n); _mm_storeu_ps(C_2, tmp0);
Notice that I am going about this by using 4 temporary xmm registers to hold the result of each dot product. In each xmm register, the result is placed into a unique 32 bits relative to the other temporary xmm registers such that the end result looks like this:
I combine the values contained in each tmp variable into one xmm variable by summing them up with the following instructions:
tmp0 = _mm_add_ps(tmp0, tmp1); tmp0 = _mm_add_ps(tmp0, tmp2); tmp0 = _mm_add_ps(tmp0, tmp3);
Finally, I add the register containing all 4 results of the dot products to a contiguous part of an array so that the array's indexes are incremented by a dot product, like so (C_0n are the 4 values currently in the array that is to be updated; C_2 is the address pointing to these 4 values):
tmp0 = _mm_add_ps(tmp0, C_0n); _mm_storeu_ps(C_2, tmp0);
I want to know if there is a less round-about, more efficient way to take the results of the dot products and add them to the contiguous chunk of the array. In this way, I am doing 3 additions between registers that only have 1 non-zero value in them. It seems there should be a more effective way to go about this.
I appreciate all help. Thank you.