I hit the same problem, and realized that a 3d FFT must internally also run an FFT along Y as one of its steps. It is a pity that this is not exposed in the CuFFT interface. Strangely doing a batches 2D FFT along the inner dimensions and then an inverse 1D FFT along the inner dimension is often faster than the outer loop launching batches. But it is a bit less precise.