79321252

Date: 2025-01-01 06:47:28
Score: 1
Natty:
Report link

SIMD should work faster that standard C code. Could you try the below suggestions?

  1. Try loop unrolling the for loop in msa_memcpy_test. Whole point of introducing SIMD is to eliminate the for loops by executing code in a vector fashion and avoid the for loop overheads. Although I am not sure if it will impact much given that SIMDs have ~1 CPI. You could use #pragma unroll for this.

  2. Try aligning addresses of *src, *dest. Unaligned memory addresses could impact SIMD performance. You'll have to do this when you call malloc in main(). There are functions that could do it for you or you could use some pointer arithmetic. See this - https://tabreztalks.medium.com/memory-aligned-malloc-6c7b562d58d0

  3. Do you really need that __builtin_msa_ld_w? You could type cast src to *v4i32 and pass it directly to __builtin_msa_st_w.

Reasons:
  • Blacklisted phrase (0.5): medium.com
  • Long answer (-0.5):
  • Has code block (-0.5):
  • Contains question mark (0.5):
  • Low reputation (1):
Posted by: Rahul Shrotey