The issue posed is totally unrelated to the reason behind this behavior.
In the setup, we perform matrix multiplication for single-precision numbers on both the GPU and CPU. The test is performed by comparing the outputs of CPU and GPU. While this seems reasonable, both are different machines and hence will have a different computational accuracy because of floating point limitations. The difference scales with input matrix size because the floating point error difference between the machines accumulates.
Another thing that can be explain now is why reducing block size led to a test case passing. This is because for a reduced block size, the errors produced are within accepted bounds. But this does not still work for bigger inputs because the computation by a single thread requires it to scan an entire row and column, each of which scales with input size.
MAX_VALUE: This behavior is still not obvious. No computation goes beyond fp limit. So I am not sure how this can be explained.