Reports

Why is there such a large performance gap between the native C++ execution and the C# P/Invoke call?

Generally for SIMD in general C++ is performing better than C#, that would naturally explain the time delta here. In fact, the compiler doesn't behave the same when treating the same snippet of code for both language :

C# JIT ASM AVX:

xor         edx,edx                          ; initialise edx (loop counter i) to zero

; LOOP_START
mov         ecx,dword ptr [rsi+8]            ; load vx.Length into ecx
cmp         edx,ecx                          ; if i >= vx.Length
jae         000007FE95B958E7                 ; throw IndexOutOfRangeException
lea         r8d,[rdx+3]                      ; load i+3 into r8d
cmp         r8d,ecx                          ; if i+3 >= vx.Length
jae         000007FE95B958E7                 ; throw IndexOutOfRangeException
movups      xmm0,xmmword ptr [rsi+rdx*4+10h] ; load vx[i..i+3] into xmm0

mov         ecx,dword ptr [rdi+8]            ; load vy.Length into ecx
cmp         edx,ecx                          ; if i >= vy.Length
jae         000007FE95B958E7                 ; throw IndexOutOfRangeException
cmp         r8d,ecx                          ; if i+3 >= vy.Length
jae         000007FE95B958E7                 ; throw IndexOutOfRangeException
movups      xmm1,xmmword ptr [rdi+rdx*4+10h] ; load vy[i..i+3] into xmm1

paddd       xmm0,xmm1                        ; perform SIMD addition of xmm0 and xmm1
mov         ecx,dword ptr [rax+8]            ; load result.Length into ecx
cmp         edx,ecx                          ; if i >= result.Length
jae         000007FE95B958EC                 ; throw ArgumentException
cmp         r8d,ecx                          ; if i+3 >= result.Length
jae         000007FE95B958F1                 ; throw ArgumentException

movups      xmmword ptr [rax+rdx*4+10h],xmm0 ; more result out of xmm0 into the result array

add         edx,4                            ; increment loop counter, i, by 4
cmp         edx,3E8h                         ; if i < 1000 (0x3E8)
jl          000007FE95B9589A                 ; go back to LOOP_START

C++ MSVC2015 AVX2:

; array initialisation and loop setup omitted...

; SIMD_LOOP_START
vmovdqu     ymm1,ymmword ptr [rax-20h]           ; load 8 ints (256 bits) from x into 256-bit register ymm1
vpaddd      ymm1,ymm1,ymmword ptr [rcx+rax-20h]  ; add 8 ints from y to those in ymm1 and store result back in ymm1
vmovdqu     ymmword ptr [r8+rax-20h],ymm1        ; move result out of ymm1 into the result array
vmovdqu     ymm2,ymmword ptr [rax]               ; load the next 8 ints from x into ymm2
vpaddd      ymm1,ymm2,ymmword ptr [rcx+rax]      ; add the next 8 ints from y to those in ymm2 and store the result in ymm1
vmovdqu     ymmword ptr [r8+rax],ymm1            ; move the result out of ymm1 into the result array
lea         rax,[rax+40h]                        ; increment the array indexer by 16 ints (64 bytes)
sub         r9,1                                 ; decrement the loop counter
jne         main+120h                            ; if loop counter != 0 go back to SIMD_LOOP_START

; SIMPLE_LOOP_START
mov         ecx,dword ptr [rbx+rax]              ; load one int from x into ecx
add         ecx,dword ptr [rax]                  ; add one int from y to the value in ecx and store the result in ecx
mov         dword ptr [rdx+rax],ecx              ; move the result out of ecx into the result array
lea         rax,[rax+4]                          ; increment the array indexer by one int (4 bytes)
sub         rdi,1                                ; decrement the loop counter
jne         main+160h                            ; if loop counter != 0 go back to SIMPLE_LOOP_START

Which lead to a conclusion that the compiler when operating C++ is able to auto-vectorize when necessary which gain a lot of execution time.

What can I do to bring the C#-called version closer to the performance of the native C++?

A thing to notice mainly is that vectorized will allways be faster than scalar. You will gain from 1.9 to 3.5 processing time using vectorized byte structure. You're using it in C++ (std::vector<uint8_t> image(width * height)) and not in C# (byte[] image = new byte[width * height];) which can have an effect. Vectorized is better for time saving because AVX2 instruction can operate on 8 or 16 bytes in one clock cycle and so parallelized. Where for the scalar container the processor executes one instruction per data element in sequence.

How does libraries like OpenCvSharp achieve excellent performance with P/Invoke?

OpenCV often avoids byte[] by using Mat objects that directly access the memory pointer that minimizes the marshaling needs.

Conclusion

I would highly recommend using vector instead of scalar container to save time. But keep note that you can use alternative "raw-er" memory storage, with memory pool and raw pointers, but to keep it simple & stupid (KISS) you can use Vectors. Notice that C++ will always be faster than C#, but you can get close.

79169601