Reports

My question is, is there anyway to optimize the code further? I have tried many things but can not make any improvement with respect to the normal code.

I'll toss in some other lessons from a career doing this stuff:

• Sometimes the processor you are working on is just slow / narrow. Something on the order of a raspberry pi might have the Neon unit "double pumped", which is to say that while the register is indeed 128-bits wide the ALUs that service it are only 64-bits wide and so every 128-bit vector instruction is cracked internally in the processor to be 2 instructions, like ancient MMX or 32-bit ARM processors. This is especially likely on implementations that are low power and have instructions that work on half-sized vectors. If I was getting a 2x speed gain and that seemed to be the speed of light, I would start getting some documentation on the implementation details of the processor to find out whether it had been double pumped. We might imagine that a particularly low energy and tiny processor might be quad pumped. Really, this is among the first questions I ask before taking on an optimization project for a new processor. Where can I get the processor implementation details? Often the company doesn't just hand them out, but there might be third party sources who have run experiments to figure out, such as Agner Fog's excellent latency tables at agner.org.

• Not all instructions are equal. The ones that operate lanewise -- that is, the ones where each element in argument 1 operates on the corresponding element of argument 2, and not horizontally within the same vector -- tend to be more efficient. The ones that do not are often cracked into microcode and run internally as several instructions. We might imagine that vaddvfp for example could be cracked into as many as 3 vaddpfp instructions, so might count double or triple the normal instruction, depending on the ALU width as described above, not to mention inserting some long latency pipeline bubbles. You should in general avoid any instructions that operate horizontally, though there might be some exceptions where adjacent lanes combine to make a larger result like vaddl and similar. An exception is usually the permute/shuffle instructions, though those can be cracked as well. 32-bit neon implementations frequently have vtbl1 as a single instruction, but vtbl2/3/4 are cracked into multiple. On 64-bit vtbl1q might be a single instruction and indeed maybe even vtbl2q depending on maker, but the 3/4 variants probably not. Usually be suspicious of anything likely to be cracked in this way. An exception might be SVE for which this is a part of the design and which has a mechanism to solve the slowdown on future hardware.

• You'll need to be aware of other machine hazards such as read after write to the same address or a similar address on another page, a possible 10 cycle delay between the vector ALUs and the Integer ALUs with the vector just running 10 cycles behind everything else, denormal faults, page faults, extra long latency instructions like division, etc. Usually these can be spotted by running an assembly level sampler and looking for instructions that seem to take up an unusually long period of time (have a lot of samples landing on them). You'll need to be aware of decoders that process multiple instructions per cycle which may cause a repeating pattern of samples landing on every 4th instruction rather than equally distributed around. If you see a lot of time going to something in comparison to everything else, either its a loop and those instructions are getting hammered, or something quite wrong is happening there, and then you'll need to figure out what is causing it (cache inhibited memory, anyone?) and see if there is a solution. Code that runs well has the time evenly distributed among the instructions in the trace.

• Some other replies have pointed out instruction latency as a problem. This certainly is a problem for in-order machines, but most of those are long gone. Something raspberry-pi grade, however, might still be in order. (I haven't bothered to pick up the platform.) Still the capabilities of smaller out of order hardware and older phones might be quite limited. When out of order is working well, we should expect loops to unroll into the out of order buffering mechanism with possibly several hundred instructions in flight. In this case, it is likely your latency is covered by the other loop iterations hoisted between the instructions from the current one. It is possible to write code where each loop iteration depends on the last (like a IIR audio filter). Then that doesn't work. Usually however, a loop loads some data from memory, does stuff too it and writes it back somewhere. For those, the loads of succeeding loop iterations will hopefully be hoisted past stores of the previous loops (assuming no false aliasing stalls) and the arithmetic can fill all the pipeline bubbles.

Unfortunately, at the end of the day, it is unusual that people can just sit down and look at the code and say definitively "There's your problem!" There are some obvious things like division and cache misses, but much of the other stuff is situation dependent and may be hidden by a higher level language. I find the best approach is to look at the assembly level sampler output and read the tea leaves from there. Once you know the problem, then the optimization will suggest itself. Don't just optimize because you think you know the answer. Measure.

79337730