Why is there such a large performance gap between the native C++ execution and the C# P/Invoke call?
Generally for SIMD in general C++ is performing better than C#, that would naturally explain the time delta here. In fact, the compiler doesn't behave the same when treating the same snippet of code for both language :
C# JIT ASM AVX:
xor edx,edx ; initialise edx (loop counter i) to zero
; LOOP_START
mov ecx,dword ptr [rsi+8] ; load vx.Length into ecx
cmp edx,ecx ; if i >= vx.Length
jae 000007FE95B958E7 ; throw IndexOutOfRangeException
lea r8d,[rdx+3] ; load i+3 into r8d
cmp r8d,ecx ; if i+3 >= vx.Length
jae 000007FE95B958E7 ; throw IndexOutOfRangeException
movups xmm0,xmmword ptr [rsi+rdx*4+10h] ; load vx[i..i+3] into xmm0
mov ecx,dword ptr [rdi+8] ; load vy.Length into ecx
cmp edx,ecx ; if i >= vy.Length
jae 000007FE95B958E7 ; throw IndexOutOfRangeException
cmp r8d,ecx ; if i+3 >= vy.Length
jae 000007FE95B958E7 ; throw IndexOutOfRangeException
movups xmm1,xmmword ptr [rdi+rdx*4+10h] ; load vy[i..i+3] into xmm1
paddd xmm0,xmm1 ; perform SIMD addition of xmm0 and xmm1
mov ecx,dword ptr [rax+8] ; load result.Length into ecx
cmp edx,ecx ; if i >= result.Length
jae 000007FE95B958EC ; throw ArgumentException
cmp r8d,ecx ; if i+3 >= result.Length
jae 000007FE95B958F1 ; throw ArgumentException
movups xmmword ptr [rax+rdx*4+10h],xmm0 ; more result out of xmm0 into the result array
add edx,4 ; increment loop counter, i, by 4
cmp edx,3E8h ; if i < 1000 (0x3E8)
jl 000007FE95B9589A ; go back to LOOP_START
C++ MSVC2015 AVX2:
; array initialisation and loop setup omitted...
; SIMD_LOOP_START
vmovdqu ymm1,ymmword ptr [rax-20h] ; load 8 ints (256 bits) from x into 256-bit register ymm1
vpaddd ymm1,ymm1,ymmword ptr [rcx+rax-20h] ; add 8 ints from y to those in ymm1 and store result back in ymm1
vmovdqu ymmword ptr [r8+rax-20h],ymm1 ; move result out of ymm1 into the result array
vmovdqu ymm2,ymmword ptr [rax] ; load the next 8 ints from x into ymm2
vpaddd ymm1,ymm2,ymmword ptr [rcx+rax] ; add the next 8 ints from y to those in ymm2 and store the result in ymm1
vmovdqu ymmword ptr [r8+rax],ymm1 ; move the result out of ymm1 into the result array
lea rax,[rax+40h] ; increment the array indexer by 16 ints (64 bytes)
sub r9,1 ; decrement the loop counter
jne main+120h ; if loop counter != 0 go back to SIMD_LOOP_START
; SIMPLE_LOOP_START
mov ecx,dword ptr [rbx+rax] ; load one int from x into ecx
add ecx,dword ptr [rax] ; add one int from y to the value in ecx and store the result in ecx
mov dword ptr [rdx+rax],ecx ; move the result out of ecx into the result array
lea rax,[rax+4] ; increment the array indexer by one int (4 bytes)
sub rdi,1 ; decrement the loop counter
jne main+160h ; if loop counter != 0 go back to SIMPLE_LOOP_START
Which lead to a conclusion that the compiler when operating C++ is able to auto-vectorize when necessary which gain a lot of execution time.
What can I do to bring the C#-called version closer to the performance of the native C++?
A thing to notice mainly is that vectorized will allways be faster than scalar. You will gain from 1.9 to 3.5 processing time using vectorized byte structure. You're using it in C++ (std::vector<uint8_t> image(width * height)
) and not in C# (byte[] image = new byte[width * height];
) which can have an effect. Vectorized is better for time saving because AVX2 instruction can operate on 8 or 16 bytes in one clock cycle and so parallelized. Where for the scalar container the processor executes one instruction per data element in sequence.
How does libraries like OpenCvSharp achieve excellent performance with P/Invoke?
OpenCV often avoids byte[]
by using Mat
objects that directly access the memory pointer that minimizes the marshaling needs.
Conclusion
I would highly recommend using vector instead of scalar container to save time. But keep note that you can use alternative "raw-er" memory storage, with memory pool and raw pointers, but to keep it simple & stupid (KISS) you can use Vectors. Notice that C++ will always be faster than C#, but you can get close.