What is the limit of optimization using SIMD?

Tags: c simd
By : flow
Source: Stackoverflow.com

I need to optimize some C code, which does lots of physics computations, using SIMD extensions on the SPE of the Cell Processor. Each vector operator can process 4 floats at the same time. So ideally I would expect a 4x speedup in the most optimistic case.

Do you think the use of vector operators could give bigger speedups?


By : flow


It depends on the architecture.. For the moment I assume x86 architecture (aka SSE).

You can get factor four on tight loops easily. Just replace your existing math with SSE instruction and you're done.

You can even get a little more than that because if you use SSE you do the math in registers which are usually not used by the compiler. This frees up the general purpose register for other task such as loop control and address calculation. In short the code that surrounds the SSE instruction will be more compact and execute faster.

And then there is the option to hint the memory controller how you want to access the memory, e.g. if you want to store data in a way that it bypasses the cache or not. For bandwidth hungry algorithms that may give you some more extra speed ontop of that.

This is entirely possible.

  • You can do more clever instruction-level micro optimizations than a compiler, if you know what you're doing.
  • Most SIMD instruction sets offers several powerful operations that don't have any equivalent in normal scalar FPU/ALU code (e.g. PAVG/PMIN etc. in SSE2). Even if these don't fit your problem exactly, you can often combine these instructions for great effect.
  • Not sure about Cell, but most SIMD instruction sets have features to optimize memory access, for example to prefetch data into cache. I've had very good results with these.

Now this isn't Cell or PPC at all, but a simple image convolution filter of mine got a 20x speedup (C vs. SSE2) on Atom, which is higher than the level of parallelity (16 pixels at a time).

By : dietr

On their own, no. But if the process of re-writing your algorithms to support them also happens to improve, say, cache locality or branching behaviour, then you could find unrelated speed-ups. However, this is true of any re-write...

This video can help you solving your question :)
By: admin