I see the extra overhead when using cgo to call C functions from Go, which produces the performance problems. And it can slow things down even if you're using SIMD in C.
Try comparing the performance of plain C functions (without SIMD) via cgo with Go's native performance to see how much overhead cgo adds. Also, make sure you're using compiler optimizations like -O3 and that your memory is aligned properly for SIMD.
Also you can try parallelizing the work or look for Go libraries that use SIMD directly, avoiding cgo, in case if cgo overhead is still a problem.