How to use the multiply and accumulate intrinsics in ARM Cortex-a8?


how to use the Multiply-Accumulate intrinsics provided by GCC?

float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);

Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns?


result = vml (matrix[0], vector);
result = vmla (result, matrix[1], vector);
result = vmla (result, matrix[2], vector);
result = vmla (result, matrix[3], vector);

This sequence won't work, though. The problem is that x component accumulates only x modulated by the matrix rows and can be expressed as:

result.x = vector.x * (matrix[0][0]   matrix[1][0]   matrix[2][0]   matrix[3][0]);


The correct sequence would be:

result = vml (matrix[0], vector.xxxx);
result = vmla(result, matrix[1], vector.yyyy);


NEON and SSE don't have built-in selection for the fields (this would require 8 bits in instruction incoding, per vector register). GLSL/HLSL for example does have this kind of facilities so most GPUs have also.

Alternative way to achieve this would be:

result.x = dp4(vector, matrix[0]);
result.y = dp4(vector, matrix[1]);

... // and of course, the matrix would be transpose for this to yield same result

The mul,madd,madd,madd sequence is usually preferred as it does not require write mask for the target register fields.

Otherwise the code looks good. =)

By : gpudude

This video can help you solving your question :)
By: admin