Fixed the issue by changing from F32 to F16 I was running it in google colab on T4 GPU and looks like there is some issue with F32 support on T4 https://github.com/triton-lang/triton/issues/5557