## How to make the following code faster

By : anup
Source: Stackoverflow.com
Question!
``````int u1, u2;
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long
res1, res2 initialized to zero.

l = 60;
while (l)
{
for (i = 0; i < 20; i += 2)
{
u1 = (elm1[i] >> l) & 15;
u2 = (elm1[i + 1] >> l) & 15;

for (k = 0; k < 20; k += 2)
{
simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);
simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);
simdb = _mm_xor_si128  (simda, simdb);
_mm_store_si128 ((__m128i *)&res1[i + k], simdb);

simdb = _mm_load_si128 ((__m128i *)&res2[i + k]);
simdb = _mm_xor_si128  (simda, simdb);
_mm_store_si128 ((__m128i *)&res2[i + k], simdb);
}
}
l -= 4;
All res1, res2 values are left shifted by 4 bits.
}
``````

The above mentioned code is called many times in my program (profiler shows 98%).

EDIT: In the inner loop, res1[i + k] values are loaded many times for same (i + k) values. I tried with this inside the while loop, I loaded all the res1 values into simd registers (array) and use array elements inside the innermost for loop to update array elements . Once both for loops are done, I stored the array values back to the res1, re2. But computation time increases with this. Any idea where I got wrong? The idea seemed to be correct

Any suggestion to make it faster is welcome.

By : anup

Well, you could always call it fewer times :-)

The total input

There is very little you can do with a routine such as this, since loads and stores will be the dominant factor (you're doing 2 loads 1 store = 4 bus cycles for a single computational instruction).

By : Paul R

``````l = 60;
while (l)
{
for (i = 0; i ``````
``` By : prgbenz ```
``` This video can help you solving your question :) By: admin ```
``` Related Questions Flipping sign on packed SSE floats SIMD (SSE) instruction for division in GCC SIMD code runs slower than scalar code SIMD code vs Scalar Code (adsbygoogle = window.adsbygoogle || []).push({}); SSE2: Double precision log function Can this C loop be optimized further? Modulo 2*Pi using SSE/SSE2 Optimising an 1D heat equation using SIMD Fast byte-wise replace if ```
``` ```
``` About Us    Contact Us    Legal    feedback    Copyright © 2015 - All Rights Reserved - www.4answered.com ```