SIMD should work faster that standard C code. Could you try the below suggestions?
Try loop unrolling the for loop in msa_memcpy_test. Whole point of introducing SIMD is to eliminate the for loops by executing code in a vector fashion and avoid the for loop overheads. Although I am not sure if it will impact much given that SIMDs have ~1 CPI. You could use #pragma unroll for this.
Try aligning addresses of *src, *dest. Unaligned memory addresses could impact SIMD performance. You'll have to do this when you call malloc in main(). There are functions that could do it for you or you could use some pointer arithmetic. See this - https://tabreztalks.medium.com/memory-aligned-malloc-6c7b562d58d0
Do you really need that __builtin_msa_ld_w? You could type cast src to *v4i32 and pass it directly to __builtin_msa_st_w.