## SIMD code for exponentiation

By : anup
Source: Stackoverflow.com
Question!

I am using SIMD to compute fast exponentiation result. I compare the timing with non-simd code. The exponentiation is implemented using square and multiply algorithm.

Ordinary(non-simd) version of code:

``````b = 1;
for (i=WPE-1; i>=0; --i){
ew = e[i];
for(j=0; j<BPW; ++j){
b = (b * b) % p;
if (ew & 0x80000000U)  b = (b * a) % p;
ew <<= 1;
}
}
``````

SIMD version:

``````   B.data[0] = B.data[1] = B.data[2] = B.data[3] = 1U;
P.data[0] = P.data[1] = P.data[2] = P.data[3] = p;
for (i=WPE-1; i>=0; --i) {
EW.data[0] = e1[i]; EW.data[1] = e2[i]; EW.data[2] = e3[i]; EW.data[3] = e4[i];
for (j=0; j<BPW;++j){
B.v *= B.v; B.v -= (B.v / P.v) * P.v;
EWV.v = _mm_srli_epi32(EW.v,31);
M.data[0] = (EWV.data[0]) ? a1 : 1U;
M.data[1] = (EWV.data[1]) ? a2 : 1U;
M.data[2] = (EWV.data[2]) ? a3 : 1U;
M.data[3] = (EWV.data[3]) ? a4 : 1U;
B.v *= M.v; B.v -= (B.v / P.v) * P.v;
EW.v = _mm_slli_epi32(EW.v,1);
}
}
``````

The issue is though it is computing correctly, simd version is taking more time than non-simd version.

Thanks & regards, Anup.

By : anup

All functions in the for loops should be SIMD functions, not only two. Time taking to set the arguments for your 2 functions is less optimal then your original example (which is most likely optimized by the compiler)

A SIMD loop for 32 bit int data typically looks something like this:

``````for (i = 0; i < N; i += 4)
{
// load input vector(s) with data at array index i..i+3

// process vectors using SIMD instructions (i.e. no scalar code)

// store result vector(s) at array index i..i+3
_mm_store_si128(&C[i], vc);
}
``````

If you find that you need to move between scalar code and SIMD code within the loop then you probably won't gain anything from SIMD optimisation.

Much of the skill in SIMD programming comes from finding ways to make your algorithm work with the limited number of supported instructions and data types that a given SIMD architecture provides. You will often need to exploit a priori knowledge of your data set to get the best possible performance, e.g. if you know for certain that your 32 bit integer values actually have a range that fits within 16 bits then that would make the multiplication part of your algorithm a lot easier to implement.

By : Paul R