It's an alignment issue, int/float4 requires different alignment than int/float3. In my example the output pointer is passed as the first argument, therefore the second one starts with an offset of 4 bytes. That works for int3/float3, but a four element vector would be "cut in half", yielding the last two elements and two undefined ones as a result.