SIMD should work faster that standard C code. Could you try the below suggestions?
Try loop unrolling the for loop in msa_memcpy_test
. Whole point of introducing SIMD is to eliminate the for loops by executing code in a vector fashion and avoid the for loop overheads. Although I am not sure if it will impact much given that SIMDs have ~1 CPI. You could use #pragma unroll
for this.
Try aligning addresses of *src, *dest
. Unaligned memory addresses could impact SIMD performance. You'll have to do this when you call malloc
in main()
. There are functions that could do it for you or you could use some pointer arithmetic. See this - https://tabreztalks.medium.com/memory-aligned-malloc-6c7b562d58d0
Do you really need that __builtin_msa_ld_w
? You could type cast src
to *v4i32
and pass it directly to __builtin_msa_st_w
.