With the help of Anthropic I have found the issue. In the first kernel I was defining the swap space DenseXY while in the second the 3D matrix was declared DenseZY. I did not think this could make any difference except for how many cache misses I would have maybe encountered. Actually if I change all the declarations to DenseXY it compiles and runs.
By the way, for the sake of good order, I also understood that the density of the stride is opposite to what my intuition brought me to:
Stride3D.DenseXY:
Memory order: X → Y → Z (X changes fastest, Z changes slowest) For array[z][y][x]: consecutive X elements are adjacent in memory Memory layout: [0,0,0], [0,0,1], [0,0,2], ..., [0,1,0], [0,1,1], ..., [1,0,0]
Stride3D.DenseZY:
Memory order: Z → Y → X (Z changes fastest, X changes slowest) For array[x][y][z]: consecutive Z elements are adjacent in memory Memory layout: [0,0,0], [1,0,0], [2,0,0], ..., [0,1,0], [1,1,0], ..., [0,0,1]