## iPhone ARMv6 VFP asm latency, throughput and hazards

By : genesys
Source: Stackoverflow.com
Question!

on page 21-25 (pdf page 875) the througput and latency timings are given for the assembly instructions of the VFP unit.

Are those numbers independant of vectorsize?

1: let's take FMULS which has throughput of 1 and latency of 8. does it mean that i can start in each cycle a new FMULS operation if i don't use a register which is not currently calculated by a previous function? for example:

``````FMULS s8, s16, s20
FMULS s12, s21, s25
``````

will those exectue right after each other?

2: what happens if I have two FMULS functions after each other where one argument depends upon the previous computation

``````FMULS s8, s16, s20
FMULS s12, s21, s8
``````

will the VFP wait for 8 cycles before starting to process the second instruction?

3: what if we are in vectormode with 4 elements and on the second FMULS instruction all inputregisters but one are available. what will happen?

4: sqrt and division: will a sqrt or division operation prevent any subsequent operation from being started for 19 cycles?

thanks!

By : genesys

Are those numbers independent of vectorsize?

No. See, for example, Table 21-15 in the document you linked. Note the latency of the short vector `FADDS`.

does it mean that I can start a new `FMULS` operation every cycle if it doesn't depend on an earlier result that isn't available yet?

Yes, that's the definition of throughput.

what happens if I have two FMULS functions after each other where one argument depends upon the previous computation

Execution will stall until the result of the first `FMULS` is available. See 21.6 "Operation of the scoreboards" for more detail.

what if we are in vectormode with 4 elements and on the second FMULS instruction all inputregisters but one are available. what will happen?

It will stall. Same reference.

sqrt and division: will a sqrt or division operation prevent any subsequent operation from being started for 19 cycles?

No. See section 21.10 "Parallel Execution". An example is given in Table 21-15, in which a non-dependent `FADDS` executes immediately following `FDIVS`.

Note that it can be a bit of a challenge (though not impossible) to write short-vector VFP code that performs substantially faster than scalar code for many types of computation. Even if you learn how to do it, it will be of questionable value since the NEON unit seems to be the new model for vector computation on ARM. You may be better served in the long run by ignoring the short-vector operation for now and focusing on learning NEON for the future.