Newer ARM processors include the PLD and PLI instructions.
I'm writing tight inner loops (in C++) which have a non-sequential memory access pattern, but a pattern that naturally my code fully understands. I would anticipate a substantial speedup if I could prefetch the next location whilst processing the current memory location, and I would expect this to be quick-enough to try out to be worth the experiment!
I'm using new expensive compilers from ARM, and it doesn't seem to be including PLD instructions anywhere, let alone in this particular loop that I care about.
How can I include explicit prefetch instructions in my C++ code?