If anyone is interested in real numbers for the fixed code:
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_DirectCopy/threads:1 100234 ns 102534 ns 7467
BM_AVX2/threads:1 127413 ns 125558 ns 5600
BM_MemcpyChunked/threads:1 123646 ns 122768 ns 5600
BM_Memcpy/threads:1 92502 ns 87891 ns 6400
I run it on Intel, so I replaced NEON with AVX2 equivalent.